Abstract

Functional genomics networks are widely used to identify unexpected pathway relationships in large genomic datasets. However, it is challenging to compare the signal-to-noise ratios of different networks and to identify the optimal network with which to interpret a particular genetic dataset. We present GeNets, a platform in which users can train a machine-learning model (Quack) to carry out these comparisons and execute, store, and share analyses of genetic and RNA-sequencing datasets.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Lage, K. Biochim. Biophys. Acta 1842, 1971–1980 (2014).

  2. 2.

    Li, T. et al. Nat. Methods 14, 61–64 (2017).

  3. 3.

    Greene, C. S. et al. Nat. Genet. 47, 569–576 (2015).

  4. 4.

    Lundby, A. et al. Nat. Methods 11, 868–874 (2014).

  5. 5.

    Okada, Y. et al. Nature 506, 376–381 (2014).

  6. 6.

    Lage, K. et al. Nat. Biotechnol. 25, 309–316 (2007).

  7. 7.

    Edgar, R., Domrachev, M. & Lash, A. E. Nucleic Acids Res. 30, 207–210 (2002).

  8. 8.

    Cowley, G. S. et al. Sci. Data 1, 140035 (2014).

  9. 9.

    Li, Y., Calvo, S. E., Gutman, R., Liu, J. S. & Mootha, V. K. Cell 158, 213–225 (2014).

  10. 10.

    Lamb, J. Nat. Rev. Cancer 7, 54–60 (2007).

  11. 11.

    Sanders, S. J. et al. Neuron 87, 1215–1233 (2015).

  12. 12.

    Schizophrenia Working Group of the Psychiatric Genomics Consortium. Nature 511, 421–427 (2014).

  13. 13.

    Deciphering Developmental Disorders Study. Nature 542, 433–438 (2017).

  14. 14.

    Clark, N. E. & Garman, S. C. J. Mol. Biol. 393, 435–447 (2009).

  15. 15.

    Lek, M. et al. Nature 536, 285–291 (2016).

  16. 16.

    Shannon, P. et al. Genome Res. 13, 2498–2504 (2003).

  17. 17.

    Szklarczyk, D. et al. Nucleic Acids Res. 43, D447–D452 (2015).

  18. 18.

    Zuberi, K. et al. Nucleic Acids Res. 41, W115–W122 (2013).

  19. 19.

    Cornish, A. J. & Markowetz, F. PLoS Comput. Biol. 10, e1003808 (2014).

  20. 20.

    Wong, A. K., Krishnan, A., Yao, V., Tadych, A. & Troyanskaya, O. G. Nucleic Acids Res. 43, W128–W133 (2015).

  21. 21.

    Barzel, B. & Barabási, A.-L. Nat. Biotechnol. 31, 720–725 (2013).

  22. 22.

    Feizi, S., Marbach, D., Médard, M. & Kellis, M. Nat. Biotechnol. 31, 726–733 (2013).

Download references

Acknowledgements

This work was supported in part by the US National Institutes of Health (NHLBI grants HHSN268201000033C and R01HL096738 to S. Carr; NCI Clinical Proteomics Tumor Analysis Consortium initiative grant U24CA160034 to S. Carr; grants 1R01MH109903, U01-DK078616, and 5P01HD068250-07 to K.L., A.K., T. Li, and H.H.), the Executive Committee On Research at Massachusetts General Hospital (Fund for Medical Discovery Award to H.H.), the MGH IRG American Cancer Society (H.H. and K.L.), the Stanley Center at the Broad Institute (grant to K.L., A.K., T. Li, and H.H.), the Broad Institute (Broadnext10 grant to K.L., A.K., T. Li, and H.H.), the Lundbeck Foundation (Large Thematic Project Grant to K.L., A.K., T. Li, and H.H.), and the Simons Foundation (SFARI; Research Award to K.L., A.K., T. Li, and H.H.)

Author information

Author notes

  1. These authors contributed equally: Taibo Li and April Kim.

Affiliations

  1. Department of Surgery, Massachusetts General Hospital, Boston, MA, USA

    • Taibo Li
    • , April Kim
    • , Heiko Horn
    • , Liraz Greenfeld
    •  & Kasper Lage
  2. Broad Institute of MIT and Harvard, Cambridge, MA, USA

    • Taibo Li
    • , April Kim
    • , Joseph Rosenbluh
    • , Heiko Horn
    • , Liraz Greenfeld
    • , David An
    • , Andrew Zimmer
    • , Arthur Liberzon
    • , Jon Bistline
    • , Ted Natoli
    • , Yang Li
    • , Aviad Tsherniak
    • , Rajiv Narayan
    • , Aravind Subramanian
    • , Ted Liefeld
    • , Bang Wong
    • , Dawn Thompson
    • , Sarah Calvo
    • , Steve Carr
    • , Jesse Boehm
    • , Jake Jaffe
    • , Jill Mesirov
    • , Nir Hacohen
    • , Aviv Regev
    •  & Kasper Lage
  3. Department of Electrical Engineering & Computer Science, MIT, Cambridge, MA, USA

    • Taibo Li
  4. Johns Hopkins School of Medicine, Baltimore, MD, USA

    • Taibo Li
  5. Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA

    • Joseph Rosenbluh
  6. Howard Hughes Medical Institute and Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA

    • Yang Li
    •  & Sarah Calvo
  7. Department of Statistics, Harvard University, Cambridge, MA, USA

    • Yang Li
  8. Department of Medicine, University of California, San Diego, San Diego, CA, USA

    • Jill Mesirov
  9. Center for Immunology and Inflammatory Diseases, Massachusetts General Hospital, Boston, MA, USA

    • Nir Hacohen
  10. Department of Surgery, Harvard Medical School, Boston, MA, USA

    • Nir Hacohen
    •  & Kasper Lage
  11. Howard Hughes Medical Institute, Department of Biology, MIT, Cambridge, MA, USA

    • Aviv Regev
  12. Institute for Biological Psychiatry, Mental Health Center Sct. Hans, University of Copenhagen, Roskilde, Denmark

    • Kasper Lage

Authors

  1. Search for Taibo Li in:

  2. Search for April Kim in:

  3. Search for Joseph Rosenbluh in:

  4. Search for Heiko Horn in:

  5. Search for Liraz Greenfeld in:

  6. Search for David An in:

  7. Search for Andrew Zimmer in:

  8. Search for Arthur Liberzon in:

  9. Search for Jon Bistline in:

  10. Search for Ted Natoli in:

  11. Search for Yang Li in:

  12. Search for Aviad Tsherniak in:

  13. Search for Rajiv Narayan in:

  14. Search for Aravind Subramanian in:

  15. Search for Ted Liefeld in:

  16. Search for Bang Wong in:

  17. Search for Dawn Thompson in:

  18. Search for Sarah Calvo in:

  19. Search for Steve Carr in:

  20. Search for Jesse Boehm in:

  21. Search for Jake Jaffe in:

  22. Search for Jill Mesirov in:

  23. Search for Nir Hacohen in:

  24. Search for Aviv Regev in:

  25. Search for Kasper Lage in:

Contributions

T. Li, A.K., H.H., L.G., D.A., A.Z., J. Bistline, B.W., A.R., and K.L. developed the GeNets platform. T. Li, A.K., J.R., H.H., L.G., D.A., A.Z., A.L., J. Bistline, T.N., Y.L., A.T., R.N., A.S., T. Liefeld, B.W., D.T., S. Carr, S. Calvo, J. Boehm, J.J., J.M., N.H., A.R., and K.L. analyzed data and performed experiments. T. Li and K.L. wrote the paper with input from all other authors. K.L. initiated, designed, and led the project.

Competing interests

The authors declare no competing interests.

Corresponding author

Correspondence to Kasper Lage.

Integrated supplementary information

  1. Supplementary Figure 1 Differential pathway topologies across functional genomics networks.

    a) For a given pathway, we measure its topological properties exemplified here with the 22 genes of the AKT pathway in the InWeb protein-protein interaction network. We make the same measurements for all genes in the AKT pathway context set (grey squares), in this case 2,449 genes (only 2 of which are shown for illustration) that have at least one connection to an AKT gene in InWeb. The distributions for 4 of 18 topological properties are shown and illustrate the differences between pathway (dark blue) and context (light grey) distributions. b) This procedure is repeated for 853 non-redundant pathways in the InWeb network. The distributions of the broader population show that genes in a common pathway have a topological signature that distinguishes them from context genes. c) Repeating the procedure detailed in b) for the other four networks shows this is a general principle. d) When quantified and compared, it is clear that each network has a unique distribution of topological metrics [colors as indicated in panels b/c]]. In all panels the x-axis denotes the respective metrics and the y-axis is the relative frequency (density) of observations. We use the following abbreviations: interaction (int.), member (mbr), distribution (dist.), weighted (Wt.), pathway (P), overall network (N); e.g. Eigenvector (P) denotes the Eigenvector centrality in the pathway.

  2. Supplementary Figure 2 Comparing distributions of topological metrics between pathway members and context genes using InWeb.

    Using the 853 pathways, we compute each metric for pathway proteins based on their protein interactions (with other proteins of the same pathway) and individually compute the same metric for a maximum of 1,500 of the context proteins for each pathway based on how each context protein interacts with the pathway proteins. The x-axis is the metric as indicated (on the log scale to facilitate visualization) and the y-axis is the scaled density (blue denotes the distribution for pathway members and grey the context genes).

  3. Supplementary Figure 3 Comparing distributions of topological metrics between pathway members and context genes using CLIMENet.

    Using the 853 pathways, we compute each metric for pathway genes based on their phylogenetic similarity (with other genes of the same pathway) and individually compute the same metric for a maximum of 1,500 of the context genes for each pathway based on how each context gene is connected to the pathway genes. The x-axis is the metric as indicated (on the log scale to facilitate visualization) and the y-axis is the scaled density (orange denotes the distribution for pathway members and grey the context genes).

  4. Supplementary Figure 4 Comparing distributions of topological metrics between pathway members and context genes using GEONet.

    Using the 853 pathways, we compute each metric for pathway genes based on their correlation in expression (with other genes of the same pathway) and individually compute the same metric for a maximum of 1,500 of the context genes for each pathway based on how each context gene is correlated to the pathway genes. The x-axis is the metric as indicated (on the log scale to facilitate visualization) and the y-axis is the scaled density (purple denotes the distribution for pathway members and grey the context genes).

  5. Supplementary Figure 5 Comparing distributions of topological metrics between pathway members and context genes using LINCSNet.

    Using the 853 pathways, we compute each metric for pathway genes based on their cell perturbation profiles (with other genes of the same pathway) and individually compute the same metric for a maximum of 1,500 of the context genes for each pathway based on how each context gene is connected to the pathway genes. The x-axis is the metric as indicated (on the log scale to facilitate visualization) and the y-axis is the scaled density (green denotes the distribution for pathway members and grey the context genes).

  6. Supplementary Figure 6 Comparing distributions of topological metrics between pathway members and context genes using AchillesNet.

    Using the 853 pathways, we compute each metric for pathway genes based on their cancer codependencies (with other genes of the same pathway) and individually compute the same metric for a maximum of 1,500 of the context genes for each pathway based on how each context gene is codependent to the pathway genes. The x-axis is the metric as indicated (on the log scale to facilitate visualization) and the y-axis is the scaled density (red denotes the distribution for pathway members and grey the context genes).

  7. Supplementary Figure 7 Differential topological signatures emerge across networks.

    Using the 853 pathways, we compute each metric for pathway genes (within each pathway) based on the connections defined by the respective networks. The x-axis is the metric as indicated (on the log scale to facilitate visualization) and the y-axis is the scaled density (blue: InWeb, red: AchillesNet, green: LINCSNet, purple: GEONet, orange: CLIMENet).

  8. Supplementary Figure 8 Classification performance (AUC) across networks, spurious edge-removal methods, and network sizes.

    For each functional data set, we thresholded the top positive connections using the original data, network deconvoluted data, and globally silenced data and selected 5 thresholds: 500K (a), 750K (b), 1M (c), 1.25M (d), and 1.5M (e). For each network, method, and threshold, we train and test the performance of Quack using a 70/30 split on the 853 pathways (N = 597 pathways for training and N = 256 pathways for testing). AUCs are computed based on holdout data and empirical confidence intervals are computed for each of the classifiers by bootstrapping trees from each forest. Center line, median; error bars, 2.5th and 97.5th percentiles.

  9. Supplementary Figure 9 Sensitivity of classification of significantly connected pathways across networks, spurious edge-removal methods, and network sizes.

    For each functional data set, we thresholded the top positive connections using the original data, network deconvoluted data, and globally silenced data and selected 5 thresholds: 500K (a), 750K (b), 1M (c), 1.25M (d), and 1.5M (e). For each network, method, and threshold, we compute the density (# edges / # possible edges) for each of the 853 pathways based on the connections found in the respective pathways. A null distribution for the density metric is also computed based on N = 250 randomly sampled gene sets of similar size and degree distribution as the pathway under consideration, from which we compute an empirical p-value for each pathway. The sensitivity is computed by assessing how many of the 853 pathways were deemed significantly connected at an alpha=0.05 significance level. Center line, median; error bars, 2.5th and 97.5th percentiles.

  10. Supplementary Figure 10 Building a general classifier to predict pathway membership from networks.

    a) For a given pathway, we measure its topological properties exemplified here with the 21 genes of the AKT pathway in the InWeb protein-protein interaction network. In the matrix, the 18 topological properties are shown as columns and the corresponding values for each of the 21 genes in the AKT pathway (black circles) as rows (metric values correspond to colors as indicated in the figure legend). One row in this matrix corresponds to one row in the final modeling dataset. We make the same measurements for genes in the context of the AKT pathway (white squares); only 2 of 2,449 context genes shown in the illustration. b) This procedure is repeated for 853 pathways from which the modeling dataset used to train the classifier is derived. For any candidate gene in a network, the classifier can assign a probability that it belongs to a pathway (e.g., the AKT pathway) as defined by the candidates’ topological properties in the overall network and in relation to a specific set of genes (e.g., the 21 AKT genes).

  11. Supplementary Figure 11 True positive rates by probability decile across all five networks.

    For each network, we score the 30% holdout of 853 pathways (N = 256 pathways) and their context after training and testing the respective classifiers. For each network, we use the classifier assigned probabilities (assigned to pathways and their contexts) and compute deciles of the predicted probability distribution. Here, the decrease in true positive rate (# of pathway members / all genes in the decile) in lower deciles further illustrates the predictive power of the classifiers and the consistency between the predicted probability and the true positive rate. The number of pathway members (N_p) and context genes (N_c) considered in the 30% holdout set for each network are as follows: AchillesNet (N_p = 1,323; N_c = 202,532), GEONet (N_p = 3,676; N_c = 240,465), InWeb (N_p =6,584; N_c = 220,077), CLIMENet (N_p =1,482; N_c = 141,998), LINCSNet (N_p =2,279; N_c =227,554).

  12. Supplementary Figure 12 GeNets nominated potential genes implicated in autism spectrum disorders on the basis of pathway relationships.

    a) From the 31 candidate genes discovered in Main Text Fig. 2d, de annotated genes in genome-wide significant schizophrenia loci with orange and genes in which de novo mutations have been found in neurodevelopmental delay with purple. b) Genes under brain-specific regulation are also annotated (large nodes correspond to genes that have brain-specific eQTLs). Network layouts are identical in panels a, b and Main Text Fig. 2d allowing gene names to be inferred.

  13. Supplementary Figure 13 Ranking the importance of pathway topological metrics across networks.

    By permuting the values of each topological metric being evaluated it is possible to estimate the overall importance of each topological metric across networks. Here the topological properties are in descending order by their average rank across networks. The y-axis is the rank (1-18), where 18 is most important metric for distinguishing pathway members and 1 is least important. We use the following abbreviations weighted (Wt.), pathway (P), overall network (N) and local clustering coefficient (LCC), so that LCC (P) and LLC (N) means local clustering coefficient in the pathway and network, respectively. Closeness and Eigenvector centrality is consistently important across networks (column 18 and 17, respectively), while there is significant variation in the predictive power of e.g., the local clustering coefficient in the network [LCC (N), column 8]. We also observe that some metrics such as the degree in the pathway (column 1), are less important in all networks when controlling for others topological metrics.

  14. Supplementary Figure 14 Visualizing differential pathway topologies across networks.

    We plotted the network-specific eigenvector centralities of genes in the PDGF pathway (n = 121 genes), ERBB1 downstream pathway (n = 105 genes), and E2F pathway (n = 74 genes), indicated by row. Large nodes denote high values and small nodes denote low values with respect to a specific pathway across networks. Only pathway members that have network information in one of the networks are shown. To enable a straightforward visual comparison, we pooled all five networks and laid out the genes in each pathway based on this one meta-network. Edges connecting genes in a given pathway correspond to the network indicated by the column. Although non-pathway genes have been omitted for clarity, the pathways are embedded in very complex network-specific neighborhoods involving thousands (ranging from 1,386 to 4,208) of context genes. While the eigenvector centrality is generally high for pathway members across all networks, we also observe considerable divergence in the gene-specific patterns and strengths of these values, and in the patterns of connections between pathway sets.

  15. Supplementary Figure 15 A comparison of Quack with SANTA and GeneMANIA.

    a) For each of N = 45 neural pathways presented in Main Text Fig. 2a, we randomly masked 30% of pathway genes and asked each of Quack, SANTA, and GeneMANIA to distinguish holdout genes based on their relationship with the 70% seed genes. We used InWeb for Quack and SANTA, and default network for GeneMANIA. AUCs were calculated based on method-specific scores and pathway membership, and plotted as distributions for each method. b) For each of the N = 853 canonical MSigDB pathways presented in Main Text Fig. 1c, we repeated the same analyses for Quack and SANTA and plotted AUC distributions. Center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points, outliers.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–15, Supplementary Tables 1 and 2 and Supplementary Notes 1–8

  2. Reporting Summary

  3. Supplementary Data 1

    853 curated canonical pathways from the Molecular Signatures Database

  4. Supplementary Software 1

    GeNets source code and example data

  5. Source Data, Fig. 1

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0039-6