Article | Published:

Similarity network fusion for aggregating data types on a genomic scale

Nature Methods volume 11, pages 333337 (2014) | Download Citation

Abstract

Recent technologies have made it cost-effective to collect diverse types of genome-wide data. Computational methods are needed to combine these data to create a comprehensive view of a given disease or a biological process. Similarity network fusion (SNF) solves this problem by constructing networks of samples (e.g., patients) for each available data type and then efficiently fusing these into one network that represents the full spectrum of underlying data. For example, to create a comprehensive view of a disease given a cohort of patients, SNF computes and fuses patient similarity networks obtained from each of their data types separately, taking advantage of the complementarity in the data. We used SNF to combine mRNA expression, DNA methylation and microRNA (miRNA) expression data for five cancer data sets. SNF substantially outperforms single data type analysis and established integrative approaches when identifying cancer subtypes and is effective for predicting survival.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    , , & Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).

  2. 2.

    et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17, 98–110 (2010).

  3. 3.

    Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).

  4. 4.

    , , , & Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28, 3290–3297 (2012).

  5. 5.

    Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012).

  6. 6.

    Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).

  7. 7.

    , & Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912 (2009).

  8. 8.

    , , & A survey of statistical network models. Foundations and Trends in Machine Learning. 2, 129–233 (2010).

  9. 9.

    Network medicine -from obesity to the 'diseasome. N. Engl. J. Med. 357, 404–407 (2007).

  10. 10.

    Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1988).

  11. 11.

    et al. Integrated array-comparative genomic hybridization and expression array profiles identify clinically relevant molecular subtypes of glioblastoma. Cancer Res. 65, 1678–1686 (2005).

  12. 12.

    et al. Hotspot mutations in H3F3A and IDH1 define distinct epigenetic and biological subgroups of glioblastoma. Cancer Cell 22, 425–437 (2012).

  13. 13.

    et al. Protein alterations associated with temozolomide resistance in subclones of human glioblastoma cell lines. J. Neurooncol. 107, 89–100 (2012).

  14. 14.

    , & Applied Survival Analysis: Regression Modeling of Time to Event Data (Wiley, 2011).

  15. 15.

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

  16. 16.

    , & Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98, 5116–5121 (2001).

  17. 17.

    et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).

  18. 18.

    et al. Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Sci. Transl. Med. 5, 181 (2013).

  19. 19.

    & Point: Are we prepared for the future doctor visit? Nat. Biotechnol. 29, 215–218 (2011).

  20. 20.

    et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, 1665–1674 (2007).

  21. 21.

    et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).

  22. 22.

    , , , & Unsupervised metric fusion by cross diffusion. in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2997–3004 (IEEE, 2012).

  23. 23.

    , & On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002).

  24. 24.

    & Towards efficient hierarchical designs by ratio cut partitioning. in Proc. Int. Conf. Computer-Aided Design 298–301 (ICCAD, 1989).

  25. 25.

    A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007).

  26. 26.

    et al. Network-based survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment. PLoS Comput. Biol. 9, e1002975 (2013).

Download references

Acknowledgements

This study used data generated by TCGA and METABRIC; we thank TCGA, the Cancer Research UK and the British Columbia Cancer Agency Branch for sharing these invaluable data with the scientific community. We thank N. Jabado, M. Wilson and J. Rommens for feedback on the manuscript, and B. Sousa for help with the figures. This study was partially funded by the Government of Canada through Genome Canada and the Ontario Genomics Institute (OGI-068) to M.B.; A.G. is funded by the SickKids Research Institute. Z.T. was supported by NSF IIS-1360568.

Author information

Author notes

    • Bo Wang
    •  & Benjamin Haibe-Kains

    Present addresses: Department of Computer Science, Stanford University, Stanford, California, USA (B.W.). and Ontario Cancer Institute, Princess Margaret Cancer Centre—University Health Network, Toronto, Ontario, Canada (B.H.-K.).

Affiliations

  1. Genetics and Genome Biology, SickKids Research Institute, Toronto, Ontario, Canada.

    • Bo Wang
    • , Aziz M Mezlini
    • , Feyyaz Demir
    • , Michael Brudno
    •  & Anna Goldenberg
  2. Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.

    • Aziz M Mezlini
    • , Feyyaz Demir
    • , Marc Fiume
    • , Michael Brudno
    •  & Anna Goldenberg
  3. Department of Cognitive Science, University of California San Diego, San Diego, California, USA.

    • Zhuowen Tu
  4. Institut de Recherches Cliniques de Montréal, Université de Montréal, Montréal, Quebec, Canada.

    • Benjamin Haibe-Kains

Authors

  1. Search for Bo Wang in:

  2. Search for Aziz M Mezlini in:

  3. Search for Feyyaz Demir in:

  4. Search for Marc Fiume in:

  5. Search for Zhuowen Tu in:

  6. Search for Michael Brudno in:

  7. Search for Benjamin Haibe-Kains in:

  8. Search for Anna Goldenberg in:

Contributions

B.W. and A.G. conceived of and designed the approach. B.W. performed the data analysis, implemented the method in Matlab and performed all computational experiments. A.M.M. performed data preparation. F.D. wrote the R code that is distributed with the paper. M.F. assisted with network visualization and analysis. Z.T. helped with method design and theoretical framework. B.H.-K. assisted in preparation and analysis of the METABRIC data. B.W., M.B. and A.G. wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Anna Goldenberg.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–20, Supplementary Table 1, Supplementary Notes 1 –3 and Supplementary Results

Zip files

  1. 1.

    Supplementary Software

    Similarity Network Fusion for aggregating multiple data types

  2. 2.

    Supplementary Data

    TCGA cancer datasets after pre-processing

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nmeth.2810