Abstract

A key unmet challenge in interpreting omics experiments is inferring biological meaning in the context of public functional genomics data. We developed a computational framework, Your Evidence Tailored Integration (YETI; http://yeti.princeton.edu/), which creates specialized functional interaction maps from large public datasets relevant to an individual omics experiment. Using this tailored integration, we predicted and experimentally confirmed an unexpected divergence in viral replication after seasonal or pandemic human influenza virus infection.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

The virus infection microarray data are available in GEO under accession GSE55278. Researchers may submit their data of interest for YETI analysis at http://yeti.princeton.edu/. Visualization and exploration of their YETI network and precomputed YETI networks are also available at http://yeti.princeton.edu. All data used in this study are available from the corresponding author on request.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Rung, J. & Brazma, A. Reuse of public genome-wide gene expression data. Nat. Rev. Genet. 14, 89–99 (2013).

  2. 2.

    Dolinski, K. & Troyanskaya, O. G. Implications of Big Data for cell biology. Mol. Biol. Cell 26, 2575–2578 (2015).

  3. 3.

    Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).

  4. 4.

    Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255 (2003).

  5. 5.

    De Smet, R. & Marchal, K. Advantages and limitations of current network inference methods. Nat. Rev. Microbiol. 8, 717–729 (2010).

  6. 6.

    Song, L., Langfelder, P. & Horvath, S. Comparison of co-expression measures: mutual information, correlation, and model based indices. BMC Bioinformatics 13, 328 (2012).

  7. 7.

    Lee, H. K., Hsu, A. K., Sajdak, J., Qin, J. & Pavlidis, P. Coexpression analysis of human genes across many microarray data sets. Genome Res. 14, 1085–1094 (2004).

  8. 8.

    Wren, J. D. A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature–data divide. Bioinformatics 25, 1694–1701 (2009).

  9. 9.

    Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–1121 (2011).

  10. 10.

    Huttenhower, C. et al. Exploring the human genome with functional maps. Genome Res. 19, 1093–1106 (2009).

  11. 11.

    Park, C. Y. et al. Functional knowledge transfer for high-accuracy prediction of under-studied biological processes. PLoS Comput. Biol. 9, e1002957 (2013).

  12. 12.

    Gorenshteyn, D. et al. Interactive big data resource to elucidate human immune pathways and diseases. Immunity 43, 605–614 (2015).

  13. 13.

    Greene, C. S. et al. Understanding multicellular function and disease with human tissue–specific networks. Nat. Genet. 47, 569–576 (2015).

  14. 14.

    Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat. Rev. Genet. 14, 333–346 (2013).

  15. 15.

    Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).

  16. 16.

    Grossman, R. L. et al. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375, 1109–1112 (2016).

  17. 17.

    Clough, E. & Barrett, T. The Gene Expression Omnibus Database. Methods Mol. Biol. 1418, 93–110 (2016).

  18. 18.

    Hartmann, B. M. et al. Human dendritic cell response signatures distinguish 1918, pandemic, and seasonal H1N1 influenza viruses. J. Virol. 89, 10190–10205 (2015).

  19. 19.

    Nogusa, S. et al. RIPK3 activates parallel pathways of MLKL-driven necroptosis and FADD-mediated apoptosis to protect against influenza A virus. Cell Host Microbe 20, 13–24 (2016).

  20. 20.

    Hartmann, B. M. et al. Pandemic H1N1 influenza A viruses suppress immunogenic RIPK3-driven dendritic cell death. Nat. Commun. 8, 1931 (2017).

  21. 21.

    Bender, A. et al. The distinctive features of influenza virus infection of dendritic cells. Immunobiology 198, 552–567 (1998).

  22. 22.

    Collado-Torres, L. et al. Reproducible RNA-seq analysis using Recount2. Nat. Biotechnol. 35, 319–321 (2017).

  23. 23.

    Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).

  24. 24.

    Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39, D52–D57 (2011).

  25. 25.

    Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41, D816–D823 (2013).

  26. 26.

    Kerrien, S. et al. The IntAct molecular interaction database in 2012. Nucleic Acids Res. 40, D841–D846 (2012).

  27. 27.

    Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2012).

  28. 28.

    Pagel, P. et al. The MIPS mammalian protein–protein interaction database. Bioinformatics 21, 832–834 (2005).

  29. 29.

    Portales-Casamar, E. et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105–D110 (2010).

  30. 30.

    Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).

  31. 31.

    Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).

  32. 32.

    Kotera, M., Hirakawa, M., Tokimatsu, T., Goto, S. & Kanehisa, M. The KEGG databases and tools facilitating omics analysis: latest developments involving human diseases and pharmaceuticals. Methods Mol. Biol. 802, 19–39 (2012).

  33. 33.

    Schaefer, C. F. et al. PID: the Pathway Interaction Database. Nucleic Acids Res. 37, D674–D679 (2009).

  34. 34.

    Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 40, D742–D753 (2012).

  35. 35.

    Myers, C. L., Barrett, D. R., Hibbs, M. A., Huttenhower, C. & Troyanskaya, O. G. Finding function: evaluation methods for functional genomic data. BMC Genomics 7, 187 (2006).

  36. 36.

    Myers, C. L. & Troyanskaya, O. G. Context-sensitive data integration and prediction of biological networks. Bioinformatics 23, 2322–2330 (2007).

  37. 37.

    Friedman, N., Geiger, D. & Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997).

  38. 38.

    Steck, H. & Jaakkola, T. S. On the Dirichlet prior and Bayesian regularization. In Advances in Neural Information Processing Systems (eds Becker, S., Thrun, S. & Obermayer, K.) 713–720 (MIT Press, Boston, MA, 2002).

  39. 39.

    Huttenhower, C., Schroeder, M., Chikina, M. D. & Troyanskaya, O. G. The Sleipnir library for computational functional genomics. Bioinformatics 24, 1559–1561 (2008).

  40. 40.

    Brucker, P. An O(n) algorithm for quadratic knapsack problems. Oper. Res. Lett. 3, 163–166 (1984).

  41. 41.

    Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58, 267–288 (1996).

  42. 42.

    Szekely, G. J. & Rizzo, M. L. Brownian distance covariance. Ann. Appl. Stat. 3, 1236–1265 (2009).

  43. 43.

    Simon, N. & Tibshirani, R. Comment on “Detecting novel associations in large data sets” by Reshef Et Al, Science Dec 16, 2011. arXiv Preprint at https://arxiv.org/abs/1401.7645 (2014).

  44. 44.

    Lockhart, R., Taylor, J., Tibshirani, R. J. & Tibshirani, R. A significance test for the Lasso. Ann. Stat. 42, 413–468 (2014).

  45. 45.

    Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. Ann. Stat. 32, 407–451 (2004).

  46. 46.

    Diestel, R. Graph Theory (Springer, Berlin/Heidelberg, 2018).

  47. 47.

    Bordería, A. V., Hartmann, B. M., Fernandez-Sesma, A., Moran, T. M. & Sealfon, S. C. Antiviral-activated dendritic cells: a paracrine-induced response state. J. Immunol. 181, 6872–6881 (2008).

  48. 48.

    Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

Download references

Acknowledgements

We thank R. Dannenfelser for help in processing the TCGA RNA-seq datasets and A. Krishnan for discussions regarding the network evaluations. We greatly appreciate all members of the Troyanskaya lab for their valuable advice and discussions. This work was supported in part by the NIH (grant NIH U19 AI117873 to S.C.S.; grant NIH R01 GM071966 to O.G.T.). O.G.T. is a senior fellow of the Genetic Networks program of the Canadian Institute for Advanced Research (CIFAR).

Author information

Author notes

    • Young-suk Lee

    Present address: School of Biological Sciences, Seoul National University, Seoul, Korea

Affiliations

  1. Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA

    • Young-suk Lee
    • , Alicja Tadych
    •  & Olga G. Troyanskaya
  2. Department of Computer Science, Princeton University, Princeton, NJ, USA

    • Young-suk Lee
    •  & Olga G. Troyanskaya
  3. Flatiron Institute, Simons Foundation, New York, NY, USA

    • Aaron K. Wong
    • , Christopher Y. Park
    •  & Olga G. Troyanskaya
  4. Department of Neurology and Center for Advanced Research on Diagnostic Assays, Icahn School of Medicine at Mount Sinai, New York, NY, USA

    • Boris M. Hartmann
    • , Elena Zaslavsky
    •  & Stuart C. Sealfon
  5. Department of Microbiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA

    • Veronica A. DeJesus
    •  & Irene Ramos

Authors

  1. Search for Young-suk Lee in:

  2. Search for Aaron K. Wong in:

  3. Search for Alicja Tadych in:

  4. Search for Boris M. Hartmann in:

  5. Search for Christopher Y. Park in:

  6. Search for Veronica A. DeJesus in:

  7. Search for Irene Ramos in:

  8. Search for Elena Zaslavsky in:

  9. Search for Stuart C. Sealfon in:

  10. Search for Olga G. Troyanskaya in:

Contributions

Y.-s.L., E.Z., S.C.S., and O.G.T. conceived and designed the research. Y.-s.L. performed the computational analyses with contributions from C.Y.P. A.K.W., A.T., and Y.-s.L. developed the web interface. B.M.H., V.A.D., and I.R. performed the molecular experiments. Y.-s.L., E.Z., S.C.S., and O.G.T. wrote the manuscript with revisions from all other authors.

Competing interests

The authors declare no competing interests.

Corresponding authors

Correspondence to Elena Zaslavsky or Stuart C. Sealfon.

Integrated supplementary information

  1. Supplementary Figure 1 Overview of other approaches.

    (left) Coexpression networks are based entirely on the correlated expression in a specific dataset. This yields functional relationships that are highly relevant to the specific dataset but does not capture accurately biological pathway interactions. (right) A generic Bayesian integration using a global data compendium accurately identifies biological pathway interactions, but the functional relationships in this network have a low specificity to any specific dataset.

  2. Supplementary Figure 2 YETI computational framework to construct dataset-specific functional networks.

    Known functional interactions are categorized into 237 distinctive biological processes spanning the multifaceted interaction landscape of the human genome. The context-specific interaction network of each biological process is learned through Bayesian integration of the public data compendium. These 237 Bayesian functional networks (i.e. source networks) are then selected based on similar interaction patterns in the user dataset of interest.

  3. Supplementary Figure 3 Example work-through of the YETI webserver.

    The user first submits her omics dataset of interest in a simple tab-separated values (TSV) format with genes in rows and omics assays in columns. The web server then integrates the public data compendium in accordance to the latent data structure of the input dataset. The user can then easily explore the dataset-relevant source networks and the dataset-specific functional map of query genes to gain deeper insight into the omics dataset used as input.

  4. Supplementary Figure 4 Effect of exclusion of a single dataset from generic or YETI integrations.

    Distribution of the Dataset Specificity Score for including the dataset, excluding it, and using YETI are shown. The center line represents the median, the lower and upper hinges indicate the first and third quartiles, the upper whisker extends to the largest value less than 1.5 x IQR and the lower whisker extends to the smallest value at most 1.5 x IQR. 10 of the 362 GEO datasets used for evaluation in Fig. 2 were chosen at random to be excluded from or included into the generic integrations that were evaluated over the MeSH terms relevant to each dataset (See Supplemental Online Methods). YETI achieved significantly improved dataset specificity over generic integration (**p = 3.1 x 10-4, one-tailed paired t test), and including the dataset of interest in the generic integration had no effect on specificity (p = 0.87, one-tailed paired t test). N.S. = not significant.

  5. Supplementary Figure 5 Evaluation of YETI network performance robustness to the number of directly relevant datasets in the data compendium.

    Accuracy score of YETI networks from disease datasets were grouped by the number of datasets annotated to the disease, excluding the user dataset used for YETI analysis. Boxplots were drawn as in Supplementary Fig. 3. The sample sizes of boxplots from left to right are: 52, 36, 64, 35, and 249.

  6. Supplementary Figure 6 Evaluation of vulnerability of the density of YETI networks and co-expression networks to dataset size.

    (a) Network densities of co-expression networks exponentially decreased with greater dataset size. (b) Network densities of YETI networks were consistently low even across input datasets of different sizes.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–6

  2. Reporting Summary

  3. Supplementary Data 1

    Name and description of datasets included in the public data compendium

  4. Supplementary Data 2

    ID and name of the 237 source networks

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0218-5

Further reading