Abstract
Diverse, high-dimensional modalities collected in large cohorts present new opportunities for the formulation and testing of integrative scientific hypotheses. Similarity-driven multi-view linear reconstruction (SiMLR) is an algorithm that exploits inter-modality relationships to transform large scientific datasets into smaller, more well-powered and interpretable low-dimensional spaces. SiMLR contributes an objective function to identify joint signal regularization based on sparse matrices representing prior within-modality relationships and an implementation that permits application to joint reduction of large data matrices. We demonstrate that SiMLR outperlforms closely related methods on supervised learning problems in simulation data, a multi-omics cancer survival prediction dataset and multiple modality neuroimaging datasets. Taken together, this collection of results shows that SiMLR may be applied to joint signal estimation from disparate modalities and may yield practically useful results in a variety of application domains.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
All visualized plots in the main manuscript are generated from our code capsule, which contains both the specific data sources and software calls necessary to reproduce the figures80.
The simulation data are built dynamically in R. The scripts that generate the data are publicly available in our code capsule80. We downloaded evaluation data from the multi-omic cancer benchmark47 website at http://acgt.cs.tau.ac.il/multi_omic_benchmark/download.html. Data are available in our code capsule80 along with the relevant statistical details and calls needed to reproduce the results reported here. The data are free to use with no restrictions. The brain age data used here were obtained from PTBP81. These data were originally downloaded from https://figshare.com/articles/dataset/The_Pediatric_Template_of_Brain_Perfusion_PTBP_/923555. The relevant subset is available in our code capsule80. The data are free to use with no restrictions. Supplementary data used here were obtained from the PING study database (https://chd.ucsd.edu/research/ping-study.html). PING requires a user to register and request data. The review of the request may also require institutional support and justification of data use. We originally gained access to these data in 2013 as part of the PING-in-a-box service, which is now defunct. Data used here were also obtained from the ADNI database (http://adni.loni.usc.edu). ADNI was launched in 2003 as a public-private partnership, led by M. W. Weiner. The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment and early Alzheimer’s disease. For up-to-date information, see http://adni.loni.usc.edu. The investigators within ADNI contributed to the design and implementation of ADNI and/or provided data, but did not participate in the analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Authorship_List.pdf. ADNI requires a user to register and request data. The review of the request may also require institutional support and justification of data use. We originally gained access to these data in 2008. The version used in the Supplementary Information was downloaded in August 2020 from LONI.
Code availability
ANTsR is open source and freely available at https://github.com/ANTsX/ANTsR. The development of the code available on GitHub is ongoing. The specific release version of the code and scripts used for the analysis and generation of figures in the main body of this manuscript are available in our code capsule80.
Change history
04 March 2021
A Correction to this paper has been published: https://doi.org/10.1038/s43588-021-00049-4
References
Cole, J. H., Marioni, R. E., Harris, S. E. & Deary, I. J. Brain age and other bodily ‘ages’: implications for neuropsychiatry. Mol. Psychiatry 24, 266–281 (2019).
Wray, N. R. et al. Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat. Genet. 50, 668–681 (2018).
Habeck, C., Stern, Y. & Alzheimer’s Disease Neuroimaging Initiative. Multivariate data analysis for neuroimaging data: overview and application to Alzheimer’s disease. Cell Biochem. Biophys. 58, 53–67 (2010).
Shamy, J. L. et al. Volumetric correlates of spatiotemporal working and recognition memory impairment in aged rhesus monkeys. Cereb. Cortex 21, 1559–1573 (2011).
McKeown, M. J. et al. Analysis of fMRI data by blind separation into independent spatial components. Hum. Brain Mapp. 6, 160–188 (1998).
Calhoun, V. D., Adali, T., Pearlson, G. D. & Pekar, J. J. A method for making group inferences from functional MRI data using independent component analysis. Hum. Brain Mapp. 14, 140–151 (2001).
Calhoun, V. D., Liu, J. & Adali, T. A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data. Neuroimage 45, S163–S172 (2009).
Avants, B. B., Cook, P. A., Ungar, L., Gee, J. C. & Grossman, M. Dementia induces correlated reductions in white matter integrity and cortical thickness: a multivariate neuroimaging study with sparse canonical correlation analysis. Neuroimage 50, 1004–1016 (2010).
de Pierrefeu, A. et al. Structured sparse principal components analysis with the TV-elastic net penalty. IEEE Trans. Med. Imaging 37, 396–407 (2018).
Du, L. et al. Structured sparse canonical correlation analysis for brain imaging genetics: an improved GraphNet method. Bioinformatics 32, 1544–1551 (2016).
Avants, B. et al. Sparse unbiased analysis of anatomical variance in longitudinal imaging. In Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Jiang, T. et al.) 324–331 (Springer, 2010).
Avants, B. B. et al. Sparse canonical correlation analysis relates network-level atrophy to multivariate cognitive measures in a neurodegenerative population. Neuroimage 84, 698–711 (2014).
Du, L.et al. in Brain Informatics and Health (eds Guo, Y. etal.) 275–284 (Springer, 2015)..
Guigui, N. et al. Network regularization in imaging genetics improves prediction performances and model interpretability on Alzheimer’s disease. In Proc. IEEE 16th International Symposium on Biomedical Imaging. 1403–1406 (IEEE, 2019).
Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).
Chalise, P. & Fridley, B. L. Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLoS ONE 12, e0176278 (2017).
Dhillon, P. et al. Subject-specific functional parcellation via Prior Based Eigenanatomy. Neuroimage 99, 14–27 (2014).
Tikhonov, A. N. On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943).
Bell, J. B Solutions of ill-posed problems. Math. Comput. 32, 1320–1322 (1978).
Smilde, A. K., Westerhuis, J. A. & de Jong, S. A framework for sequential multiblock component methods. J. Chemom. 17, 323–337 (2003).
Tenenhaus, A. & Tenenhaus, M. Regularized generalized canonical correlation analysis. Psychometrika 76, 257–284 (2011).
Tenenhaus, M., Tenenhaus, A. & Groenen, P. J. Regularized generalized canonical correlation analysis: a framework for sequential multiblock component methods. Psychometrika 82, 737–777 (2017).
Zhan, Z., Ma, Z. & Peng, W. Biomedical data analysis based on multi-view intact space learning with geodesic similarity preserving. Neural Processing Lett. 49, 1381–1398 (2019).
Baltrušaitis, T., Ahuja, C. & Morency, L. P. Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 423–443 (2018).
Kettenring, J. R. Canonical analysis of several sets of variables. Biometrika 58, 433–451 (1971).
Tenenhaus, A. et al. Variable selection for generalized canonical correlation analysis. Biostatistics 15, 569–583 (2014).
Rohart, F., Gautier, B., Singh, A. & LêCao, K.-A. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput. Biol. 13, e1005752 (2017).
Garali, I. et al. A strategy for multimodal data integration: application to biomarkers identification in spinocerebellar ataxia. Brief. Bioinform. 19, 1356–1369 (2017).
Gloaguen, A. et al. Multiway generalized canonical correlation analysis. Biostatisticskxaa https://doi.org/10.1093/biostatistics/kxaa010 (2020).
Hotelling, H. The most predictable criterion. J. Educ. Psychol. 26, 139–142 (1935).
Hotelling, H. Relations between two sets of variants. Biometrika 28, 321–377 (1936).
Lock, E. F., Hoadley, K. A., Marron, J. S. & Nobel, A. B. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7, 523–542 (2013).
Yu, Q., Risk, B. B., Zhang, K. & Marron, J. S. JIVE integration of imaging and behavioral data. Neuroimage 152, 38–49 (2017).
Ceulemans, E., Wilderjans, T. F., Kiers, H. A. & Timmerman, M. E. MultiLevel simultaneous component analysis: a computational shortcut and software package. Behav. Res. Methods 48, 1008–1020 (2016).
Argelaguet, R. et al. Multi-omics factor analysis–a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 14, e8124 (2018).
Carmichael, I. et al. Joint and individual analysis of breast cancer histologic images and genomic covariates. Preprint at https://arxiv.org/abs/1912.00434 (2019).
McMillan, C. T. et al. White matter imaging helps dissociate tau from TDP-43 in frontotemporal lobar degeneration. J. Neurol. Neurosurg. Psychiatry 84, 949–955 (2013).
McMillan, C. T. et al. Genetic and neuroanatomic associations in sporadic frontotemporal lobar degeneration. Neurobiol. Aging 35, 1473–1482 (2014).
Cook, P. A. et al. Relating brain anatomy and cognitive ability using a multivariate multimodal framework. Neuroimage 99, 477–486 (2014).
Hyvärinen, A. & Oja, E. Independent component analysis: a tutorial. In Notes for International Joint Conference on Neural Networks (IJCNN, 1999)..
Hyvärinen, A. & Oja, E. Independent component analysis: algorithms and applications. Neural Networks 13, 411–430 (2000).
Haykin, S. & Chen, Z. The cocktail party problem. Neural Comput. 17, 1875–1902 (2005).
Andersen, P. K. & Gill, R. D. Cox’s regression model for counting processes: a large sample study. Ann. Stat. 10, 1100–1120 (1982).
Fox, J. & Weisberg, S. An R Companion to Applied Regression 2nd edn (2011).
Huang, L. et al. Development and validation of a prognostic model to predict the prognosis of patients who underwent chemotherapy and resection of pancreatic adenocarcinoma: a large international population-based cohort study. BMC Med. 17, 1–16 (2019).
Neums, L., Meier, R., Koestler, D. C. & Thompson, J. A. Improving survival prediction using a novel feature selection and feature reduction framework based on the integration of clinical and molecular data. Pac. Symp. Biocomput. 25, 415–426 (2020).
Rappoport, N. & Shamir, R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 46, 10546–10562 (2018).
Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
Yong, W.-S., Hsu, F.-M. & Chen, P.-Y. Profiling genome-wide DNA methylation. Epigenetics Chromatin 9, 1–16 (2016).
Ozsolak, F. & Milos, P. M. RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 12, 87–98 (2011).
Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).
Barnhart, H. X., Haber, M. & Song, J. Overall concordance correlation coefficient for evaluating agreement among multiple observers. Biometrics 58, 1020–1027 (2002).
Avants, B. B. et al. The pediatric template of brain perfusion. Sci. Data 2, 1–17 (2015).
Kandel, B. M., Wang, D. J., Detre, J. A., Gee, J. C. & Avants, B. B. Decomposing cerebral blood flow MRI into functional and structural components: a non-local approach based on prediction. Neuroimage 105, 156–170 (2015).
Tustison, N. J. et al. Logical circularity in voxel-based analysis: normalization strategy may induce statistical bias. Hum. Brain Mapp. 35, 745–759 (2014).
Franke, K. & Gaser, C. Ten years of BrainAGE as a neuroimaging biomarker of brain aging: what insights have we gained?. Front. Neurol. 10, 789 (2019).
Jernigan, T. L. et al. The pediatric imaging, neurocognition, and genetics (PING) data repository. Neuroimage 124, 1149–1154 (2016).
Bro, R., Kjeldahl, K., Smilde, A. K. & Kiers, H. A. Cross-validation of component models: a critical look at current methods. Anal. Bioanal. Chem. 390, 1241–1251 (2008).
Bickel, S. & Scheffer, T. Multi-view clustering. In Proc. IEEE International Conference on Data Mining. 19–26 (ICDM, 2004).
Wang, Y., Wu, L., Lin, X. & Gao, J. Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans. Neural Netw. Learn. Syst. 29, 4833–4843 (2018).
De Vito, R., Bellio, R., Trippa, L. & Parmigiani, G. Multi-study factor analysis. Biometrics 75, 337–346 (2019).
Eddelbuettel, D. & Balamuta, J. J. Extending R with C++: a brief introduction to Rcpp. Am. Stat. 72, 28–36 (2018).
Avants, B. B., Johnson, H. J. & Tustison, N. J. Neuroinformatics and the The Insight Toolkit. Front. Neuroinform. 9, 5 (2015).
Avants, B. B. et al. A reproducible evaluation of ANTs similarity metric performance in brain image registration. Neuroimage 54, 2033–2044 (2011).
Muschelli, J. et al. Neuroconductor: an R platform for medical imaging analysis. Biostatistics 20, 218–239 (2019).
Zou, H., Hastie, T. & Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 15, 265–286 (2006).
Shen, H. & Huang, J. Z. Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal. 99, 1015–1034 (2008).
Jolliffe, I. T., Trendafilov, N. T. & Uddin, M. A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12, 531–547 (2003).
Lin, C. J. Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19, 2756–2779 (2007).
Jain, P., Netrapalli, P. & Sanghavi, S. Low-rank matrix completion using alternating minimization. In Proc. 45th Annual ACM Symposium on Theory of Computing. 665–674 (ACM, 2013).
Blumensath, T. & Davies, M. E. Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27, 265–274 (2009).
Pustina, D., Avants, B., Faseyitan, O. K., Medaglia, J. D. & Coslett, H. B. Improved accuracy of lesion to symptom mapping with multivariate sparse canonical correlations. Neuropsychologia 115, 154–166 (2018).
Hanafi, M. PLS path modelling: computation of latent variables with the estimation mode B. Comput. Stat. 22, 275–292 (2007).
Tenenhaus, A., Philippe, C. & Frouin, V. Kernel generalized canonical correlation analysis. Comput. Stat. Data Anal. 90, 114–131 (2015).
Malkov, Y. A. & Yashunin, D. A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 824–836 (2018).
Hill, W. G. & Robertson, A. Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38, 226–231 (1968).
Bahmani, S. & Raj, B. A unifying analysis of projected gradient descent for ℓp-constrained least squares. Appl. Comput. Harmon. Anal. 34, 366–378 (2013).
Martí, R., Resende, M. G. & Ribeiro, C. C. Multi-start methods for combinatorial optimization. Eur. J. Oper. Res. 226, 1–8 (2013).
Jernigan, T. L. et al. The Pediatric Imaging, Neurocognition, and Genetics (PING) Data Repository. NeuroImage 124, 1149–1154 (2016).
Avants, B. B., Tustison, N. J. & Stone, J. R. SiMLR in ANTsR: interpretable, similarity-driven multi-view embeddings from high-dimensional biomedical data. Code Ocean https://doi.org/10.24433/CO.3087836.v2 (2021).
Avants, B. B., Tustison, N. J. & Wang, D. J. J. The pediatric template of brain perfusion (PTBP). figshare https://doi.org/10.6084/m9.figshare.923555.v20 (2013).
Acknowledgements
This work is supported by a combined grant from Cohen Veterans Bioscience (CVB-461) and the Office of Naval Research (N00014-18-1-2440) as well as the National Institutes of Health (K01-ES025432-01).
Supplementary data used in the preparation of this article were obtained from the PING study database (https://chd.ucsd.edu/research/ping-study.html). The investigators within PING contributed to the design and implementation of the PING database and/or provided data, but did not participate in the analysis or writing of this report. A complete listing of investigators of the PING study can be found at ref. 79.
Supplementary data collection and sharing for this project was funded by ADNI (National Institutes of Health Grant U01 AG024904) and the Department of Defense ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica; Biogen; Bristol Myers Squibb; CereSpir; Cogstate; Eisai; Elan Pharmaceuticals; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche and its affiliated company Genentech; Fujirebio; GE Healthcare; IXICO; Janssen Alzheimer Immunotherapy Research & Development; Johnson & Johnson Pharmaceutical Research & Development; Lumosity; Lundbeck; Merck & Co.; Meso Scale Diagnostics; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (https://fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory of Neuro Imaging at the University of Southern California.
Author information
Authors and Affiliations
Contributions
B.B.A., N.J.T. and J.R.S. made substantial contributions to the conception and design of the work, and the analysis and interpretation of data. B.B.A. and N.J.T. created the software. All authors drafted and revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Computational Science thanks Steve Marron, Cathy Philippe and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Fernando Chirigati was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–5, Tables 1–4 and discussion.
Rights and permissions
About this article
Cite this article
Avants, B.B., Tustison, N.J. & Stone, J.R. Similarity-driven multi-view embeddings from high-dimensional biomedical data. Nat Comput Sci 1, 143–152 (2021). https://doi.org/10.1038/s43588-021-00029-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-021-00029-8