Unified AI framework to uncover deep interrelationships between gene expression and Alzheimer’s disease neuropathologies

Deep neural networks (DNNs) capture complex relationships among variables, however, because they require copious samples, their potential has yet to be fully tapped for understanding relationships between gene expression and human phenotypes. Here we introduce an analysis framework, namely MD-AD (Multi-task Deep learning for Alzheimer’s Disease neuropathology), which leverages an unexpected synergy between DNNs and multi-cohort settings. In these settings, true joint analysis can be stymied using conventional statistical methods, which require “harmonized” phenotypes and tend to capture cohort-level variations, obscuring subtler true disease signals. Instead, MD-AD incorporates related phenotypes sparsely measured across cohorts, and learns interactions between genes and phenotypes not discovered using linear models, identifying subtler signals than cohort-level variations which can be uniquely recapitulated in animal models and across tissues. We show that MD-AD exploits sex-specific relationships between microglial immune response and neuropathology, providing a nuanced context for the association between inflammatory genes and Alzheimer’s Disease.


Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences
Behavioural & social sciences Ecological, evolutionary & environmental sciences For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative. Reporting for specific materials, systems and methods No new data are generated in this study. All data sets used were either publically available or available subject to data-use terms and conditions as described below.
Most human brain gene expression and phenotype data sets were obtained via the AD Knowledge Portal Synapse platform (doi: 10.7303/syn2580853). Access to these data sets may only be obtained after registering for a Synapse.org account, agreeing to acknowledge data used in any publications, and submitting a data use certificate (separately as needed for each data set). Our study uses the following data sets ( All other human brain, mouse brain, and human blood data sets were downloaded from the Gene Expression Omnibus (GEO  Table 6 in their publication).
Source data for replicating all figures are provided with this paper.
For human brain gene expression data, n=3,300. We used all samples for which gene expression and phenotype data are available. These samples represent all subjects with available frozen brain samples at the time of data generation. Of the 3,300 total samples, 1,758 were used for the development of the MD-AD model (ACT, ROSMAP, and MSBB RNA-Seq data sets). The remaining 1,542 samples were used for external validation (Mayo clinic brain bank RNA-Seq data and HBTRC and MSBB microarray data sets). Finally, to provide further external validation for our method, we sought an additional animal model dataset, as well as an additional dataset from another tissue. These would indicate whether it is possible for MD-AD to transfer across species or tissues. Thus, we used 138 mouse brain gene expression samples from Matarin et al., 2015, and 711 human blood samples from the AddNeuroMed cohort.
All data meeting pre-determined quality control criteria were included for analysis. For brain gene expression data, only cortical samples were used.
MD-AD model performance was externally validated using 1,542 separate brain gene expression samples, 711 human blood samples, and 138 mouse brain gene xpression samples.
All human data was obtained via observational cohort studies, thus, randomization does not apply. Covariates were not controlled as a preprocessing step and the model was allowed to learn any implicit covariates contained in gene expression data. Instead, we use post-hoc analyses to identify covariate effects in the model. For mouse studies, data were obtained from transgenic lines along with littermate controls to minimize the presence of covariate effects (furthermore, no model training was based on mouse data -it was only used for evaluation).
Samples were collected and measures by investigators blinded to groups/phenotypes for all brain, blood, and mouse datasets. Preprocessing of datasets was blinded to group labels. Models were trained to predict phenotypes and thus group labels were used for training; however, extensive internal and external validation was performed.