Interpreting the effects of genetic variants is key to understanding individual susceptibility to disease and designing personalized therapeutic approaches. Modern experimental technologies are enabling the generation of massive compendia of human genome sequence data and associated molecular and phenotypic traits, together with genome-scale expression, epigenomics and other functional genomic data. Integrative computational models can leverage these data to understand variant impact, elucidate the effect of dysregulated genes on biological pathways in specific disease and tissue contexts, and interpret disease risk beyond what is feasible with experiments alone. In this Review, we discuss recent developments in machine learning algorithms for genome interpretation and for integrative molecular-level modelling of cells, tissues and organs relevant to disease. More specifically, we highlight existing methods and key challenges and opportunities in identifying specific disease-causing genetic variants and linking them to molecular pathways and, ultimately, to disease phenotypes.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
1000 Genomes Project Consortium, et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Bernstein, B. E. et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045–1048 (2010).
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005). This paper uses the PhastCons method to give per-base estimates of negative selection within conserved elements using multiple sequence alignments and hidden Markov models.
Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010). This paper uses the pathogenicity GERP++ method to give nucleotide and element-level constraint scores from profiling substitution rates in multiple sequence alignments.
Gulko, B., Hubisz, M. J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).
Ramani, R., Krumholz, K., Huang, Y.-F. & Siepel, A. PhastWeb: a web interface for evolutionary conservation scoring of multiple sequence alignments using phastCons and phyloP. Bioinformatics 35, 2320–2322 (2019).
Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010). This paper discusses the pathogenicity scoring method PhyloP using multiple sequence alignments and gives the per-base P value for conservation/acceleration scores per clade that reflect divergence from the neutral rate.
Baugh, E. H. et al. Robust classification of protein variation using structural modelling and large-scale data integration. Nucleic Acids Res. 44, 2501–2513 (2016).
Kobren, S. N., Chazelle, B. & Singh, M. PertInInt: an integrative, analytical approach to rapidly uncover cancer driver genes with perturbed interactions and functionalities. Cell Syst. 11, 63–74.e7 (2020). This paper uses PertInInt to assess protein variants for cancer relevance based on predicting the functional impact on physical interactions between proteins and other proteins, nucleic acids, ions, drugs and other small molecules.
Kobren, S. N. & Singh, M. Systematic domain-based aggregation of protein structures highlights DNA-, RNA- and other ligand-binding positions. Nucleic Acids Res. 47, 582–593 (2019).
Ancien, F., Pucci, F., Godfroid, M. & Rooman, M. Prediction and interpretation of deleterious coding variants in terms of protein structural stability. Sci. Rep. 8, 4480 (2018).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Gerasimavicius, L., Liu, X. & Marsh, J. A. Identification of pathogenic missense mutations using protein stability predictors. Sci. Rep. 10, 15387 (2020).
Rodrigues, C. H., Pires, D. E. & Ascher, D. B. DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability. Nucleic Acids Res. 46, W350–W355 (2018).
Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res. 42, W314–W319 (2014).
Rodrigues, C. H. M., Myung, Y., Pires, D. E. V. & Ascher, D. B. mCSM-PPI2: predicting the effects of mutations on protein–protein interactions. Nucleic Acids Res. 47, W338–W344 (2019).
Li, M., Simonetti, F. L., Goncearenco, A. & Panchenko, A. R. MutaBind estimates and interprets the effects of sequence variants on protein–protein interactions. Nucleic Acids Res. 44, W494–W501 (2016).
Dehouck, Y., Kwasigroch, J. M., Rooman, M. & Gilis, D. BeAtMuSiC: prediction of changes in protein–protein binding affinity on mutations. Nucleic Acids Res. 41, W333–W339 (2013).
Pires, D. E. V., Blundell, T. L. & Ascher, D. B. mCSM-lig: quantifying the effects of mutations on protein–small molecule affinity in genetic disease and emergence of drug resistance. Sci. Rep. 6, 29575 (2016).
Ghersi, D. & Singh, M. Interaction-based discovery of functionally important genes in cancers. Nucleic Acids Res. 42, e18 (2014).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015). This paper presents DeepSEA, a multitask deep learning model that trains and predicts cell type-specific regulatory factor binding to genomic sequence for >900 features and cell types.
Zhou, J. et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat. Genet. 51, 973–980 (2019). This paper, for the first time, finds a significant contribution of non-coding mutations to complex disease risk by demonstrating higher functional impact of de novo mutations from probands with autism compared with siblings, using mutational impacts inferred from deep learning sequence models of transcriptional and post-transcriptional effects.
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015). This paper presents, among the first approaches to predict the tissue-specific impact of changes in the non-coding genome without information on evolution or genome annotations, gkm-svm implementing an SVM classifier that uses only sequence k-mers as input.
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Miotto, R., Li, L., Kidd, B. A. & Dudley, J. T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 26094 (2016).
Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).
Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).
Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods 16, 315–318 (2019).
Shrikumar, A., Greenside, P. & Kundaje, A. in ICML’17 Proc. 34th Int. Conf. Machine Learning (eds Precup, D. & Teh, Y. W.) 3145–3153 (PMLR, 2017).
Binder, A. et al. Morphological and molecular breast cancer profiling through explainable machine learning. Nat. Mach. Intell. 3, 355–366 (2021).
Zhang, Z., Park, C. Y., Theesfeld, C. L. & Troyanskaya, O. G. An automated framework for efficiently designing deep convolutional neural networks in genomics. Nat. Mach. Intell. 3, 392–400 (2021).
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Wainschtein, P. et al. Recovery of trait heritability from whole genome sequence data. Preprint at bioRxiv https://doi.org/10.1101/588020 (2019).
Wood, A. R. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46, 1173–1186 (2014).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
Yang, J. et al. Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43, 519–525 (2011).
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Liu, X., Li, Y. I. & Pritchard, J. K. Trans effects on gene expression can drive omnigenic inheritance. Cell 177, 1022–1034.e6 (2019).
Lappalainen, T., Scott, A. J., Brandt, M. & Hall, I. M. Genomic analysis in the age of human genome sequencing. Cell 177, 70–84 (2019).
Shendure, J., Findlay, G. M. & Snyder, M. W. Genomic medicine—progress, pitfalls, and promise. Cell 177, 45–57 (2019).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Pasaniuc, B. et al. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics 30, 2906–2914 (2014).
Schurz, H. et al. Evaluating the accuracy of imputation methods in a five-way admixed population. Front. Genet. 10, 34 (2019).
Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).
Easton, D. F. et al. Gene-panel sequencing and the prediction of breast-cancer risk. N. Engl. J. Med. 372, 2243–2257 (2015).
Robbins, C. M. et al. Copy number and targeted mutational analysis reveals novel somatic events in metastatic prostate tumors. Genome Res. 21, 47–55 (2011).
Meienberg, J., Bruggmann, R., Oexle, K. & Matyas, G. Clinical sequencing: is WGS the better WES? Hum. Genet. 135, 359–362 (2016).
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012).
French, C. E. et al. Whole genome sequencing reveals that genetic conditions are frequent in intensively ill children. Intensive Care Med. 45, 627–636 (2019).
Hou, Y.-C. C. et al. Precision medicine integrating whole-genome sequencing, comprehensive metabolomics, and advanced imaging. Proc. Natl Acad. Sci. USA 117, 3053–3062 (2020).
Cassini, T. A. et al. Whole genome sequencing reveals novel IGHMBP2 variant leading to unique cryptic splice-site and Charcot–Marie–Tooth phenotype with early onset symptoms. Mol. Genet. Genom. Med. 7, e00676 (2019).
All of Us Research Program Investigators, et al. The ‘All of Us’ research program. N. Engl. J. Med. 381, 668–676 (2019).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Havrilla, J. M., Pedersen, B. S., Layer, R. M. & Quinlan, A. R. A map of constrained coding regions in the human genome. Nat. Genet. 51, 88–95 (2019).
Eilbeck, K., Quinlan, A. & Yandell, M. Settling the score: variant prioritization and Mendelian disease. Nat. Rev. Genet. 18, 599–612 (2017).
Watanabe, K. et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat. Genet. 51, 1339–1348 (2019).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Carter, H., Douville, C., Stenson, P. D., Cooper, D. N. & Karchin, R. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics 14, S3 (2013).
Shihab, H. A. et al. Ranking non-synonymous single nucleotide polymorphisms based on disease concepts. Hum. Genomics 8, 11 (2014).
Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).
Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res. 11, 863–874 (2001).
Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. SIFT missense predictions for genomes. Nat. Protoc. 11, 1–9 (2016).
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019). This paper uses the CADD pathogenicity score as an SVM classifier that integrates multiple functional genomic and evolutionary data to predict coding and non-coding variant impacts.
Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214–220 (2016). This paper presents the Eigen pathogenicity score as an unsupervised meta-score of non-coding variant fitness impact.
Park, J. S. et al. Brain somatic mutations observed in Alzheimer’s disease associated with aging and dysregulation of tau phosphorylation. Nat. Commun. 10, 1–12 (2019).
Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385.e18 (2018).
Buja, A. et al. Damaging de novo mutations diminish motor skills in children on the autism spectrum. Proc. Natl Acad. Sci. USA 115, E1859–E1866 (2018).
Wright, C. F. et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314 (2015).
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107–e107 (2016).
Nair, S., Kim, D. S., Perricone, J. & Kundaje, A. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics 35, i108–i116 (2019).
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021). This paper describes a novel deep learning framework for functional genomics sequence modelling, which combines neural network models with model interpretation tools to discover high-resolution motif syntax.
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018). This paper uses Basenji as a CNN sequence model that predicts regulatory factor binding and expression based on cap analysis gene expression (CAGE) peak data.
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M. A. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 10, e1003711 (2014).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015). This paper uses DeepBind as a framework of individual CNNs that train on and predict regulatory factor binding to DNA and RNA.
Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 18, 67 (2017).
Quang, D. & Xie, X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166, 40–47 (2019).
Richter, F. et al. Genomic analyses implicate noncoding de novo variants in congenital heart disease. Nat. Genet. 52, 769–777 (2020). This paper presentes whole-genome sequence analysis of genetic aetiology of congenital heart disease wherein HeartENN, a deep CNN sequence genomic sequence model, is applied to functional impact prediction of de novo non-coding mutations and an excess burden of high-impact mutations is observed in individuals who are affected compared with controls.
Qin, Q. & Feng, J. Imputation for transcription factor binding predictions based on deep learning. PLoS Comput. Biol. 13, e1005403 (2017).
Li, H., Quang, D. & Guan, Y. Anchor: trans-cell type prediction of transcription factor binding sites. Genome Res. 29, 281–292 (2019).
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018). This paper presents tissue-specific gene expression prediction from sequence using a deep CNN and a linear model, and application to derive constraint violation score pathogenicity, based on cumulative predicted regulatory impacts in genomic intervals.
Dey, K. K. et al. Integrative approaches to improve the informativeness of deep learning models for human complex diseases. Preprint at bioRxiv https://doi.org/10.1101/2020.09.08.288563 (2020).
Law, A. J., Kleinman, J. E., Weinberger, D. R. & Weickert, C. S. Disease-associated intronic variants in the ErbB4 gene are related to altered ErbB4 splice-variant expression in the brain in schizophrenia. Hum. Mol. Genet. 16, 129–141 (2006).
Sangermano, R. et al. ABCA4 midigenes reveal the full splice spectrum of all reported noncanonical splice site variants in Stargardt disease. Genome Res. 28, 100–110 (2018).
de Jong, V. M. et al. Post-transcriptional control of candidate risk genes for type 1 diabetes by rare genetic variants. Genes. Immun. 14, 58–61 (2012).
Cardo, L. F. et al. A Search for SNCA 3′ UTR variants Identified SNP rs356165 as a determinant of disease risk and onset age in Parkinson’s disease. J. Mol. Neurosci. 47, 425–430 (2011).
Zuallaert, J. et al. SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics 34, 4180–4188 (2018).
Louadi, Z., Oubounyt, M., Tayara, H. & Chong, K. T. Deep splicing code: classifying alternative splicing events using deep learning. Genes 10, 587 (2019).
Leung, M. K. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue-regulated splicing code. Bioinformatics 30, i121–i129 (2014).
Zhang, Y., Liu, X., MacLeod, J. & Liu, J. Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach. BMC Genomics 19, 971 (2018).
Zeng, Z. & Bromberg, Y. Predicting functional effects of synonymous variants: a systematic review and perspectives. Front. Genet. 10, 914 (2019).
Barash, Y. et al. Deciphering the splicing code. Nature 465, 53–59 (2010).
Paggi, J. M. & Bejerano, G. A sequence-based, deep learning model accurately predicts RNA splicing branchpoints. RNA 24, 1647–1658 (2018).
Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet. 50, 151–158 (2018).
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548.e24 (2019).
Cummings, B. B. et al. Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 9, eaal5209 (2017).
Ray, T. A. et al. Comprehensive identification of mRNA isoforms reveals the diversity of neural cell-surface molecules with roles in retinal development and disease. Nat. Commun. 11, 3328 (2020).
Lagarde, J. et al. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat. Genet. 49, 1731–1740 (2017).
Gupta, I. et al. Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells. Nat. Biotechnol. 36, 1197–1202 (2018).
Hardwick, S. A., Joglekar, A., Flicek, P., Frankish, A. & Tilgner, H. U. Getting the entire message: progress in isoform sequencing. Front. Genet. 10, 709 (2019).
Pan, X. & Shen, H.-B. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. BMC Bioinforma. 18, 1–14 (2017).
Pan, X., Rijnbeek, P., Yan, J. & Shen, H.-B. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics 19, 1–11 (2018).
Pan, X. & Shen, H.-B. Predicting RNA–protein binding sites and motifs through combining local and global deep convolutional neural networks. Bioinformatics 34, 3427–3436 (2018).
Yu, H., Wang, J., Sheng, Q., Liu, Q. & Shyr, Y. beRBP: binding estimation for human RNA-binding proteins. Nucleic Acids Res. 47, e26–e26 (2018).
Zhang, Z. et al. Deep-learning augmented RNA-seq analysis of transcript splicing. Nat. Methods 16, 307–310 (2019).
Wen, M., Cong, P., Zhang, Z., Lu, H. & Li, T. DeepMirTar: a deep-learning approach for predicting human miRNA targets. Bioinformatics 34, 3781–3787 (2018).
Kang, Q., Meng, J., Cui, J., Luan, Y. & Chen, M. PmliPred: a method based on hybrid model and fuzzy decision for plant miRNA–lncRNA interaction prediction. Bioinformatics 36, 2986–2992 (2020).
Park, C. Y. et al. Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk. Nat. Genet. 53, 166–173 (2021).
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Rehm, H. L. et al. ClinGen—the clinical genome resource. N. Engl. J. Med. 372, 2235–2242 (2015).
Harrison, S. M. et al. Clinical laboratories collaborate to resolve differences in variant interpretations submitted to ClinVar. Genet. Med. 19, 1096–1104 (2017).
Stenson, P. D. et al. The human gene mutation database (HGMD®): optimizing its use in a clinical diagnostic or research setting. Hum. Genet. 139, 1197–1207 (2020).
Kalia, S. S. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genet. Med. 19, 249–255 (2017).
Esposito, D. et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019).
Oughtred, R. et al. The BioGRID database: a comprehensive biomedical resource of curated protein, genetic and chemical interactions. Protein Sci. 30, 187–200 (2021).
Gelman, H. et al. Recommendations for the collection and use of multiplexed functional data for clinical variant interpretation. Genome Med. 11, 85 (2019).
Quang, D., Chen, Y. & Xie, X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761–763 (2015). This paper presents a pathogenicity scoring method, which is a deep learning (CNN) version of CADD, for coding and non-coding variant fitness impact.
Davis, C. A. et al. The Encyclopedia of DNA Elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018).
Gronau, I., Arbiza, L., Mohammed, J. & Siepel, A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Mol. Biol. Evol. 30, 1159–1171 (2013).
Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017). This paper uses a pathogenicity scoring method, LINSIGHT, to predict fitness consequences of non-coding human variation using linear modelling of functional genomic data with a probabilistic model of molecular evolution.
Ernst, C. et al. Performance of in silico prediction tools for the classification of rare BRCA1/2 missense variants in clinical diagnostics. BMC Med. Genomics 11, 35 (2018).
Hart, S. N. et al. Comprehensive annotation of BRCA1 and BRCA2 missense variants by functionally validated sequence-based computational prediction models. Genet. Med. 21, 71–80 (2019).
Findlay, G. M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562, 217–222 (2018).
Kim, S. S. et al. Improving the informativeness of Mendelian disease-derived pathogenicity scores for common disease. Nat. Commun. 11, 6258 (2020).
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Consortium, G. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Deutsch, E. W. et al. The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition. Nucleic Acids Res. 45, D1100–D1106 (2017).
Schwenk, J. M. et al. The human plasma proteome draft of 2017: building on the human plasma peptideatlas from mass spectrometry and complementary assays. J. Proteome Res. 16, 4299–4310 (2017).
Huttenhower, C. et al. Exploring the human genome with functional maps. Genome Res. 19, 1093–1106 (2009).
Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. & Botstein, D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA 100, 8348–8353 (2003).
Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C. & Morris, Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 9, S4 (2008).
Snel, B., Lehmann, G., Bork, P. & Huynen, M. A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 28, 3442–3444 (2000).
Greene, C. S. et al. Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet. 47, 569–576 (2015).
Wong, A. K., Krishnan, A. & Troyanskaya, O. G. GIANT 2.0: genome-scale integrated analysis of gene networks in tissues. Nucleic Acids Res. 46, W65–W70 (2018).
Pierson, E. et al. Sharing and specificity of co-expression networks across 35 human tissues. PLoS Comput. Biol. 11, e1004220 (2015).
Keller, M. P. et al. A gene expression network model of type 2 diabetes links cell cycle regulation in islets with diabetes susceptibility. Genome Res. 18, 706–716 (2008).
Dobrin, R. et al. Multi-tissue coexpression networks reveal unexpected subnetworks associated with disease. Genome Biol. 10, R55 (2009).
Magger, O., Waldman, Y. Y., Ruppin, E. & Sharan, R. Enhancing the prioritization of disease-causing genes through tissue specific protein interaction networks. PLoS Comput. Biol. 8, e1002690 (2012).
Yao, V. et al. An integrative tissue-network approach to identify and test human disease genes. Nat. Biotechnol.36, 1091–1099 (2018).
Roussarie, J.-P. et al. Selective neuronal vulnerability in Alzheimer’s disease: a network-based analysis. Neuron 107, 821–835.e12 (2020).
Goya, J. et al. FNTM: a server for predicting functional networks of tissues in mouse. Nucleic Acids Res. 43, W182–W187 (2015).
Ledo, J. H. et al. Lack of a site-specific phosphorylation of Presenilin 1 disrupts microglial gene networks and progenitors during development. PLoS ONE 15, e0237773 (2020).
Huang, J. K. et al. Systematic evaluation of molecular networks for discovery of disease genes. Cell Syst. 6, 484–495.e5 (2018).
Kamburov, A., Wierling, C., Lehrach, H. & Herwig, R. ConsensusPathDB—a database for integrating human functional interaction networks. Nucleic Acids Res. 37, D623–D628 (2009).
Califano, A., Butte, A. J., Friend, S., Ideker, T. & Schadt, E. Leveraging models of cell regulation and GWAS data in integrative network-based association studies. Nat. Genet. 44, 841–847 (2012).
Schaefer, R. J. et al. Integrating coexpression networks with GWAS to prioritize causal genes in maize. Plant. Cell 30, 2922–2942 (2018).
Novarino, G. et al. Exome sequencing links corticospinal motor neuron disease to common neurodegenerative disorders. Science 343, 506–511 (2014).
Leiserson, M. D. M. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).
Ruffalo, M., Koyutürk, M. & Sharan, R. Network-based integration of disparate omic data to identify ‘silent players’ in cancer. PLoS Comput. Biol. 11, e1004595 (2015).
Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507–522 (2011).
Reyna, M. A., Leiserson, M. D. M. & Raphael, B. J. Hierarchical HotNet: identifying hierarchies of altered subnetworks. Bioinformatics 34, i972–i980 (2018).
Creixell, P. et al. Kinome-wide decoding of network-attacking mutations rewiring cancer signaling. Cell 163, 202–217 (2015).
Creixell, P. et al. Pathway and network analysis of cancer genomes. Nat. Methods 12, 615–621 (2015).
Horn, H. et al. NetSig: network-based discovery from cancer genomes. Nat. Methods 15, 61–66 (2018).
Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546–1558 (2013).
Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. & Sharan, R. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6, e1000641 (2010).
Aerts, S. et al. Gene prioritization through genomic data fusion. Nat. Biotechnol. 24, 537–544 (2006).
Köhler, S., Bauer, S., Horn, D. & Robinson, P. N. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82, 949–958 (2008).
Barabási, A.-L., Gulbahce, N. & Loscalzo, J. Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 12, 56–68 (2011).
Lage, K. et al. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proc. Natl Acad. Sci. USA 105, 20870–20875 (2008).
Winter, E. E., Goodstadt, L. & Ponting, C. P. Elevated rates of protein secretion, evolution, and disease among tissue-specific genes. Genome Res. 14, 54–61 (2004).
Krishnan, A. et al. Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nat. Neurosci. 19, 1454–1462 (2016).
Chikina, M. D. & Troyanskaya, O. G. Accurate quantification of functional analogy among close homologs. PLoS Comput. Biol. 7, e1001074 (2011).
Guan, Y., Ackert-Bicknell, C. L., Kell, B., Troyanskaya, O. G. & Hibbs, M. A. Functional genomics complements quantitative genetics in identifying disease-gene associations. PLoS Comput. Biol. 6, e1000991 (2010).
Swarup, V. et al. Identification of evolutionarily conserved gene networks mediating neurodegenerative dementia. Nat. Med. 25, 152–164 (2019).
Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4, 7 (2005).
Parikshak, N. N., Gandal, M. J. & Geschwind, D. H. Systems biology and gene networks in neurodevelopmental and neurodegenerative disorders. Nat. Rev. Genet. 16, 441–458 (2015).
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Choobdar, S. et al. Assessment of network module identification across complex diseases. Nat. Methods 16, 843–852 (2019).
Lizio, M. et al. Update of the FANTOM web resource: expansion to provide additional transcriptome atlases. Nucleic Acids Res. 47, D752–D758 (2019).
Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
Lindsay, S. J. et al. HDBR expression: a unique resource for global and individual gene expression studies during early human brain development. Front. Neuroanat. 10, 86 (2016).
Zhang, Y. et al. Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach. BMC Genomics 19, 971 (2018).
Mishra, A. & Macgregor, S. VEGAS2: software for more flexible gene-based testing. Twin Res. Hum. Genet. 18, 86–91 (2015).
This work is supported by National Institutes of Health (NIH)/National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) grants U24DK100845, UGDK114907 and U2CDK114886 and NIH grant UH3TR002158 to O.G.T. The authors thank C. Park for helpful discussions.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- Deep learning
A family of machine learning approaches involving multilayer models composed of interconnected nodes, where the output of each node in the model is a function of its inputs.
- Variant effect
The biochemical or phenotypic impact of a genetic variant relative to a reference allele.
The attribute of an allele as it relates to phenotypic impact; this can be through decreased organismal fitness that is often associated with increased disease risk.
- Support vector machine
(SVM). A standard supervised machine learning approach that identifies the hyperplane (dividing line in high-dimensional space) that optimally separates positive examples from negative examples.
Short lengths of nucleic acid used in computational algorithms, oligomers of ‘k’ length, in bases.
- Convolutional neural network
(CNN). A class of deep learning models that use structure in the input data (for example, adjacencies of pixels in an image or of base pairs in a sequence) to inform connections between nodes of the model. Successive outputs often model features at increasing spatial scales (for example, for sequence models: sequence → motifs → larger sequence contexts).
- Sequence models
A deep learning framework that models the relationship between genetic sequences and properties influencing gene regulation.
- Recurrent neural network
(RNN). A type of neural network in which learning of inputs is influenced by past instances of input examples, and so output varies depending on the sequence of inputs. For example, in speech recognition, applying the context of prior words is useful for determining the meaning of a new word. (Hidden layers pass weights to input information based on previously learned examples.)
An aspect of a complex trait that may be more experimentally measurable than the entire complex trait and may be closer to an underlying biological process. For example, educational attainment is an endophenotype examined in the study of autism genetics because it is a readily measurable trait associated with autism, and expression levels of insulin receptors are endophenotypes contributing to type 2 diabetes.
- Non-coding genome
The portion of the genome that does not encode proteins, which comprises more than 98% of the total human genome length.
In a genetic study, individuals with the disease (typically, the particular affected individuals within families who are the starting points for genetic analyses).
- Tissue-specific network
A network that captures relationships between genes that participate in similar biological processes for a particular tissue or cell type.
About this article
Cite this article
Wong, A.K., Sealfon, R.S.G., Theesfeld, C.L. et al. Decoding disease: from genomes to networks to phenotypes. Nat Rev Genet (2021). https://doi.org/10.1038/s41576-021-00389-x