## Introduction

Although Alzheimer’s dementia is a central nervous system disease and type 2 diabetes MELLITUS (T2DM) is a metabolic disorder, an increasing number of epidemiological and genetic epidemiological studies show clear link between Alzheimer’s dementia and T2DM. Alzheimer’s dementia with great economic, political and social consequences is a progressive, irreversible degenerative disease of the brain and is the most common cause of dementia due to the gradual accumulation of amyloid-beta $$(A\beta )$$ and twisting of tau protein1,2, and other common brain pathologies3. Alzheimer’s dementia is also involved in inflammation and oxidative address and exhibits memory loss and cognitive dysfunction4,5.

Two mechanisms underlying T2DM are insulin resistance and insufficient insulin secretion from pancreatic β-cells4. T2DM patients are unable to process insulin signaling correctly. In response to insulin resistance, pancreatic β-cells increase insulin production. However, when pancreatic β-cells gradually lose function; insulin production cannot be increased to maintain normal glucose levels. The brain is a target organ for insulin6. Insulin signaling plays an important role in the organization and function of the brain and impaired insulin signaling induces an overactivation of GSK-3 kinase, increases tau phosphorylation, alters tau modification and neurofibrillary degeneration7. T2DM also suffer from mild to severe nervous system damage. Persistent blood glucose may impair blood flow to the brain8.

Prior work in ROSMP found an association of T2DM with incident Alzheimer’s dementia and rate of cognitive decline9. However, we did not find an association with Alzheimer’s disease (AD) pathology10. Rather, we found an association with cerebral infarcts. Other evidence from ROSMP continue to point to potential common mechanisms. For example, we found that brain insulin signaling was associated with AD pathology11. We also found interactions between GSKβ polymorphisms associated with β-amyloid deposition12.

The current approaches to identifying several shared pathophysiology processes between Alzheimer’s dementia and T2DM have several limitations. Firstly, the most previous works have focused on identifying biological pathways underlying AD and T2DM. Few attempts to discover the role of dysregulated SNPs, gene expressions and methylations have been carried out. Secondly, the conventional evidences for linking AD and T2DM purely depend on the statistical association13. There has been increasing recognition that association and causation are different concepts14. Association attempts to measure dependence between two variables, while causation is to study the distribution of the variable (effect) after taking action on the another variable (cause). The statistical tool for association analysis is the conditional distribution, while the tool for the causal analysis is the intervention calculus. Many association signals may not be causal signals and some causal signals may not show strong association. If causation loci were searched only from association loci, many causation loci might be missed. The widely used gene expression networks are co-expression networks and phenotype networks are correlation networks. The major tools for integrated omics analysis are based on association analysis. The networks in the most multilevel omics analysis are undirected graphs. It is difficult to use undirected graphs for identifying the causal paths from genetic variants to diseases.

We are facing a great challenge to shift the current analytic platforms of genetic analysis from genetic association analysis to multilevel omics causal analysis for unraveling the mechanic link between AD and T2DM. To meet this challenge, we need (1) to develop and implement causation analysis methods for genetic studies of AD and T2DM; (2) to develop a general framework for construction of multilevel causal omics networks to discover common paths from genetic variations to AD and T2DM via methylations, gene expressions and multiple phenotypes. The real dada set ROSMAP15,16 will be used to valid the multilevel omics causal networks as a useful analytic platform for identifying shared causal paths between AD and T2DM and demonstrates that the proposed methods are capable of identifying the shared pathologic paths between AD and T2DM. A program for construction of multilevel causal networks can be downloaded from https://github.com/wenrurumon/mysrc/tree/master/CNIF_0.3.0.

## Results

### Simulations

To evaluate the performance of the proposed causal network analysis, we conducted a series of simulation studies to compare the detection power and false discovery rate (FDR) for three methods: (1) weighted gene co-expression network (WGCNA), (2) structural equation model (SEM) and structural equation model coupled with integer programming (SEMIP).

We randomly generated 1,000 directed acyclic graphs (networks) with 20 nodes (15 gene expression or phenotype nodes and 5 genotype nodes) and mean 30 directed edges, 1,000 directed acyclic graphs (networks) with 30 nodes (22 expression/phenotype nodes, 8 genotype nodes), and mean 47 directed edges, and 40 nodes (30 gene expression or phenotype nodes and 10 genotype nodes), and mean 68 directed edges, respectively. Simulation results were summarized in Table 1 where we only listed undirected network results because the WGCNA can only estimate the undirected network. We calculated the power and FDR of three methods for 100, 300, 500 and 1,000 samples. We can observe that in all cases, The SEMIP had the largest power and smallest FDR. When the number of nodes in the networks increased, the power to identify the structure of the networks decreased, while FDR increased. When the number of nodes reached 40, the SEMIP can reach 68.5% power and 7.40% FDR using 1,000 samples.

### Shared genetic loci underlying AD and T2DM

The number of AD and T2DM directly connected or indirectly connected genes was summarized in Table 2. The total number of genes connected to both AD and T2DM including directly connected and indirectly connected was 759. The genes that were both directly and indirectly connected to both AD and T2DM were summarized in Table S1. The genes that were indirectly connected to AD and both directly and indirectly connected to T2DM were listed in Table S2. Similarly, the genes that were both directly and indirectly connected to AD and indirectly connected to T2DM were summarized in Table S3.

We also tested causation of 299 pathways in the KEGG pathway database to AD and T2DM (Described in detail in the Methods section). The results were summarized as follows. The number of pathways that were directly connected to both AD and T2DM was 16; the number of pathways that were directly connected to AD and indirectly connected to T2DM was 17; the number of pathways that were directly connected to T2DM and indirectly connected to AD was 18, the number of pathways that were indirectly connected to both AD and T2DM was 114; the number of pathways that were directly connected to AD and not connected to T2DM was 6; the number of pathways that were not connected to AD and directly connected to T2DM was 2.

Then, we investigated shared gene expressions via multilevel causal networks. We summarized the results as follows. The number of expression genes that were directly connected to both AD and T2DM was two genes: GRMD1B, RP1-111D6.3, the number of expression genes that were directly connected to AD, but not directly connected to T2DM was 19 (P-value < 10−4, Table S4) and the number of expression genes that were directly connected to T2DM, but not directly connected to AD was 7 (P-value < 10−4, Table S5). The number of expression genes that were indirectly connected to both AD and T2DM was 725.

Similarly, we can study shared methylation via multilevel causal networks. The number of methylated sites/ genes that were directly connected to AD, but not directly connected to T2DM was 17 (Table S6) and the number of methylated sites/genes that were directly connected to T2DM, but not directly connected to AD was 27 (Table S7). The number of methylated sites/genes that were indirectly connected to both AD and T2DM was 117 (Table S8).

The number of phenotypes that were directly connected to both AD and T2DM was six (Age, CHL, HDL ratio, LDL, Semantic memory and working memory).

### Shared CREBBP, MAPK and PI3K-AKT pathways between AD and T2DM

To assess whether CREBBP is a common genetic factor of AD and T2DM, and how CREBBP mediates the development of AD and T2DM, we searched the all possible paths from gene CREBBP to AD and T2DM in the inferred multilevel causal network. The results were shown in Fig. 1. Figure 1A plotted the path from CREBBP to AD and T2DM via MAPK and PI3K-AKT signaling pathways. The genes in the MAPK and PI3K-AKT signaling pathways, CREBBP, episodic memory, MMSE, AD and T2DM were then used to further infer causal networks using SEMs and IP. The inferred causal network was shown in Fig. 1B. From Fig. 1B we observed a path from CREBBP to AD and T2DM via gene connections: $$CREBBP\to CBL\to MAP2K4\to MAPK8\to MAPK1\to PIK3CA$$. MAPK and PI3K-AKT pathways play critical roles in memory.

### Shared TTC3, FoxO, MAPK, and PI3K-AKT Pathways between AD and T2DM

Next we presented an example to illustrate shared causal paths that started a gene directly connected to AD and indirectly connected to T2DM.

Again, we used the DFS algorithm to search the causal paths from multilevel causal networks. The causal paths from TTC3 to AD and T2DM were shown in Fig. 2. The paths from MAPK and PI3K-AKT pathway to AD and T2DM were the same as that in Fig. 1. The genes in the FoxO, MAPK and PI3K-AKT signaling pathways, TTC3, and episodic memory, MMSE, weight, AD and T2DM were then used to further infer causal networks using SEMs and IP. The structure of the inferred network was shown in Fig. 2B. There were a large number of causal paths from TTC3 to either AD or T2DM. The shared common causal paths were $$TTC3\to NLK\to CACNA2D1\to CNCNG3\to FOXO1$$$$\to CCNE1\to CYCS\to MAPK1\to PIK3CA$$ and $$TTC3\to NLK\to PLK2\to MAPK8\to MAPK1\to MAPK1\to PIK3CA$$.

### Shared morphine addiction and neuroactive ligand receptor interaction pathways

Searching the causal paths from gene HNF4G to AD and T2DM via the multilevel causal networks using the DFS algorithm, we found that HNF4G was indirectly connected to AD and T2DM. In addition to shared MAPK and PI3K-AKT pathways between AD and T2DM which were discussed in the previous sections, we observed shared two new pathways between AD and T2DM: morphine addiction and neuroactive ligand receptor interaction pathways as shown in Fig. S1A. The structure of the inferred network that consisted of shared morphine addiction and neuroactive ligand receptor interaction pathways between AD and T2DM was shown in Fig. S1B. There were more than 10 shared causal paths. We observed two shared major causal paths: (1) $$HNF4G\to NLK\to PLK2\to MAPK8\to MAPK1\to PIK3CA\to AKT1$$ amd (2) $$HNF4G\to NLK\to GNGT2\to PLCB2\to PLCB1\to ADRB1$$.

### Shared fatty acid biosynthesis and primary bile acid biosynthesis pathways

Our data also provided evidence to show that fatty acid biosynthesis and primary bile acid biosynthesis pathways were shared pathways between AD and T2DM. Searching the multilevel causal networks from APP to AD and T2DM using the DFS algorithm, we identified the shared causal paths from APP to both AD and T2DM, shown in Fig. 3A. There were two shared causal paths between AD and T2DM: $$APP\to \,neuroactive\,ligand\,receptor\,interaction\,$$ and $$APP\to fatty\,acid\,biosynthesis\to primary\,bile\,acid\,biosynthesis$$. Neuroactive ligand receptor interaction pathway was discussed in the previous section.

Next we presented the causal network structure of the shared genes between AD and T2DM in the two shared causal paths in Fig. 3B. We observed two major shared paths from APP to AD and T2DM. One path was APP → ACSLA → ACACA → NUDT9 → CMC1 → PTPLAD1 → CYP781 → CYP46A1 → working memory (or CYP781 → AMACA → working memory). Another causal path was APPF2RL3PIK3R3 → (or F2RL3S1PR3PIK3R3.

To further illustrate the validity of the inferred causal paths, we presented Fig. S2 that showed the average levels of expression of the genes in Fig. 3 for AD, T2DM and normal individuals. From Fig. 3, Figs. S2 and S3, we can observed that the genes along the path $$APP\to F2RL3\to PIK3R3\,(or\,F2RL3\to S1PR3\to PIK3R3)$$ of the individuals with AD were over expressed, and the genes along the path APP → ACSLA → ACACA → NUDT9 → CMC1 → PTPLAD1 → CYP781 → CYP46A1 → working memory (or CYP781 → AMACA → working memory) of the individuals with AD were under expressed. Genetic variation in gene APP either regulated over expressed genes or regulated under expressed genes. Both of them caused AD. For the individuals with T2DM, the majority of gene expressions along the causal paths from APP to T2DM which were regulated by genetic variation in gene APP was under expressed.

### Shared methylated genes POU3F2, KIF4B and TNSL3, and dopaminergic synapse and AMPK pathways

In this section, we illustrate how a shared gene regulates three shared gene methylations, which in turn regulate the shared pathways. Our results showed that genetic variation in gene POU3F2 regulated gene expressions in dopaminergic synapse and AMPK pathways via methylations of POU3F2, KIF4B and TMSL3, which in turn influences CHL/HDL Ration, and finally led to AD and T2DM (Fig. 4A).

Again, we presented the causal network structure of the shared genes between AD and T2DM in the two shared dopaminergic synapse and AMPK pathways in Fig. 4B. There were multiple shared directed paths from POU3F2 to AD and T2DM. A major shared directed path: m: POU3F2 → m: LOC644649 → KDM5C → PDPK2 → XPA → MK3R2 → ELK1 → AD (or → CHL → T2DM).

## Discussion

This papers addresses several issues for uncovering causal paths shared between AD and T2DM. The first issue is to shift the current paradigm of genetic analysis from association analysis to deep causal inference for uncovering the shared mechanisms between AD and T2DM. The current paradigm for discovering mechanisms of diseases is association analysis. There is increasing recognition that a large proportion of association signals are not causal signals and causal signals may not be association signals. A large number of causal signals cannot be derived from set of association signals. Only searching causal signals from association analysis, a large proportion of causal signals will be missing. Therefore, the ANMs were developed as practical causal inference methods to identify the genetic variants that cause disease.

Second issue is to shift the current paradigm of genetic analysis from genetic analysis alone to integrated causal genomic, epigenomic, transcriptional and phenotypic data analysis for unraveling the mechanisms of AD and T2DM. The widespread existing omics networks that are essentially undirected graphs. Using undirected graphs, we cannot to identify direct causal relations among diversified types of variables at multilevel and the causal routes from genetic variants to complex phenotypes via omics. In this paper, we develop novel statistical methods for multilevel causal omics network construction and provide pipelines for uncovering shared causal paths between AD and T2DM via gene expressions, DNA methylations, environments and multiple phenotypes.

The third issue is to develop algorithms that can automatically search the causal routes from genetic variations to the complex phenotypes. The size of multilevel causal omics network is large. The number of nodes of such networks can reach ten thousand. The number of causal paths is huge. Manually searching causal paths from large causal networks is infeasible. To meet the challenge of searching causal paths from large causal networks, we develop computer representation of large causal networks and algorithms for searching the causal paths.

The results of application of the proposed pipelines for identifying causal paths to real data analysis of AD and T2DM provided strong evidence to support the link between AD and T2DM and unraveled causal mechanism to explain this link. We identified the shared causal genes, gene expressions, DNA methylations and pathways between AD and T2DM. Some of them can be supported by literature and some of them are new.

Specifically, we identified the shared CREBBP, MAPK and PI3K-AKT pathways between AD and T2DM. Binding of transcription factors to the cyclic Adenosine Monophosphate (cAMP) response element (CRE) regulates the activity of RNA polymerase. cAMP Response Element binding protein (CREB) is a cellular transcription factor that binds the CRE17. CREB-binding protein (CREBBP) and CREB together mediate the conversion of short-term memory to long-term memory and alternate the activity of the β-amyloid (Aβ) peptide, which in turn regulates hippocampal-dependent synaptic plasticity18,19. Cognitive function such as working memory is involved in insulin signaling dysfunction and blood glucose levels. It was reported that working memory is linked with T2DM20,21,22.

The shared TTC3, FoxO, MAPK, and PI3K-AKT Pathways between AD and T2DM were also identified. The tetratricopeptide repeat domain 3 (TTC3) gene was an AD causing gene (P-value for causation of AD < 0.0001), but not directly connected to T2DM (P-value for causation of T2DM = 0.47). TTC3 is associated with differentiation of neurons23. It is reported that a rare TTC3 variant is related with AD24. The TTC3–RhoA pathway could be a key determinant of the neuronal development, resulting in detrimental effects on the normal differentiation program25. Rho regulates the activation of MAPK pathway26. The Forkhead box O (FoxO) transcription factors that affect nervous system amyloid (Aβ) production, are implicated in the regulation of cell apoptosis and survival, and accelerate the progression of degenerative disease. FoxO pathway is involved in the PI3K/Akt and mitogen-activated protein kinase (MAPK) pathways in neuronal apoptosis in the brain.

FoxOs also can offer protection in the nervous system, reduce toxic intracellular protein accumulations and potentially effect Aβ toxicity19,27,28. Akt-FoxO that suppresses TLR4 signaling in Human Leukocytes is implicated in the development of T2DM29. Increasing evidences indicate that PI3K/AKT pathway are implicated in the development of T2DM30,31.

Our results further supported that morphine addiction and neuroactive ligand receptor interaction pathways were shared between AD and T2DM. Morphine addiction has neurotoxic effects and damages to the brain regions that function for learning, memory and emotions32. High dose of morphine may increase risk to T2DM33. It is also reported that neuroactive ligand receptor interaction pathway is associated with both AD and T2DM34.

The causal network analysis provided evidence that fatty acid biosynthesis and primary bile acid biosynthesis pathways were shared between AD and T2DM. Brain function such as intelligence, memory, behavior and concentration are all influenced by brain nutrition35. Omega-3 fatty acids affect the fluidity of brain cell membranes, neurotransmitter synthesis and signal transmission and are implicated in AD36,37. Bile acids are involved in cell signaling and immune function. It performs as potent inhibitors of apoptosis and regulates transcriptional and post-transcriptional events that affect mitochondrial function in neurons38. A trend of increased bile acids in AD has been observed39. Fatty acid utilization induces insulin resistance40. Bile acids are signal molecules and play an important role in regulating metabolism and inflammation. The abnormal bile acids are correlated with changes in insulin secretion, which lead to T2DM41,42. The amyloid precursor protein (APP) is a transmembrane protein. The aggregated amyloid-β (Aβ) peptides are generated by sequential proteolytic processing of the APP. Accumulation of Aβ and the APP play an important role in regulating lipid homeostasis including fatty acids, which finally affect the development of AD43.

We identified an extremely large number of shared causal paths from genetic variants to both AD and T2DM via DNA methylation, gene expressions and phenotypes. This deep knowledge that uncovered the large number of causal mechanisms of AD and T2DM has profound implications in prevention and treatments of AD and T2DM. This explained why the drugs that were based on inhibition or activation of limited number of paths often failed simply because these limited number of paths cannot cover all causal paths to the diseases. Finally, the empirical evidence that the AD and T2DM shared a large number of causal genes, gene expressions, methylations and pathways supported hypothesis that AD can be considered as “type 3 diabetes”.

## Methods

All methods were carried out in accordance with relevant guidelines and regulations.

### ROSMAP data

The data came from two longitudinal cohort studies of older persons. ROS started in 1994 and enrolled Catholic nuns, priests, and brothers from more than 40 communities across United States, and MAP started in 1997 and enrolled participants with diverse backgrounds and socioeconomic status from continuous care retirement communities throughout northeastern Illinois, as well as from individual homes across the Chicago metropolitan area19. These two studies are managed by the same team of investigators. Structured, quantitative neuropathological examinations are performed at a single site. Therefore, the data can be combined for analysis. Multi-layered omics datasets are generated from biospecimens donated by ROS and MAP participants, including genotypes, DNA methylation profiles and RNA-seq. The genotype data were generated by Affymetrix or the Illumina Omniquad express gene chips and were imputed using the 1000 Genomes Project data as reference. DNA methylation profiles were measured using the Illumina Infinium HumanMethylation450 beadset. RNA-seq data were generated using the Illumina HiSeq with 101 bp paired-end reads. Multiple phenotypes including clinical diagnosis, cognitive function, measures of lifestyle, behavior, and activity, chronic medical conditions and risk factors were measured. A total of 432 individuals who simultaneously had genotype, RNA-seq, DNA methylation and some phenotypes were included in analysis. We considered 19 phenotypes and environments, two diseases (AD, T2DM), 299 pathways with RNA-Seq in KEGG pathway database, 20,242 methylation genes with 364,661 CpG sites, and 51, 060 genotyped genes with 5,711,541 SNPs (4,283,876 common SNPs, 1,427,665 rare SNPs). All the data were downloaded from https://www.radc.rush.edu/.

The ROSMAP studies were approved by the Institutional Review Board of Rush University Medical Center. Written informed consent was obtained from all subjects, followed by an Anatomic Gift Act for organ donation.

### General procedures for identifying shared genetic loci underlying AD and T2DM

AD and T2DM result from the interplay of DNA sequence variation and nongenetic factors acting through molecular networks51,52,53. Their etiology is complex with many intermediate steps between genetic variation and diseases. Neither traditional GWAS, nor classical multi-omics analysis can identify the causal passes of complex diseases because not all these analyses can identify directed routes from genetic loci to diseases through environments, methylations, gene expressions, and phenotypes. To overcome these limitations, we developed a novel general framework for identifying all possible causal passes from genetic loci to diseases. The framework consists of three steps. The first step is to perform genome-wide causation studies (GWCS) where we test causation of each SNP across the genome to the disease. The additive noise model (ANM) with discrete variants will be used to test for causation54 (Methods). We focused on the rare variants in the paper. The second step is to use integer programming (IP) and various modern causal models55,56,57 (Methods) for inferring multilevel genome-wide omics causal networks that integrate genotype subnetworks, environmental subnetworks, methylation subnetworks, gene regulatory subnetworks, intermediate phenotype subnetworks and multiple disease subnetworks into a single connected multilevel genotype-disease network as shown in Fig. 5. The third step is to augment graph theoretical approaches with approximations for developing efficient search algorithms that discover all possible routes starting from the genetic variant node directed to the disease node, including classical Depth First Search (DFS) and Breadth First Search (BFS) algorithms58,59,60,61.

There are two ways to identify shared dysfunctional genes (SNPs) between AD and T2DM. One way is to use ANM with discrete variables and functional data analysis to conduct genome-wide causation analysis54,62,63,64 for unravelling the direct connections between gene nodes and disease nodes to identify the shared dysfunctional genes between AD and T2DM.

Another way is to search the paths from the gene nodes to AD and T2DM in multilevel causal omics networks.

Association and causation are different concepts. Association between two variables is often characterized by dependence between two variables. Causation is a connection of phenomena where one variable acts or intervenes on another variables and leads to its changes. Therefore, the key component of causation is the generation and determination of values of one variable by another. The mechanism of causation is related to the transference of matter, motion and information. Causation as part of universe connection, is well known that nature consists of autonomous and independent causal generating process modules. These modules will not influence each other63,65. In other words, while output of one module may inform or influence input of another module, the events between modules are independent. In the probabilistic language, mechanism is often represented by conditional distribution. Independent mechanism states that “the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other conditional distributions”63. In GWCS, we only consider two variables. In this case, independence of cause and mechanism (ICM) indicates that the conditional distribution of the effect given its cause is independent of distribution of cause. Consider the genetic analysis of alleles (A) with a disease allele A a normal allele a and with the disease (D) (disease D and normal d). The joint density function $$P(a,d)$$ can be decomposed into

$$\begin{array}{c}P(A,D)=P(A)P(D|A)\\ \,\,\,\,\,=\,P(D)P(A|D).\end{array}$$

In the association analysis, we assess whether A is independent of D or not. The relationship between A and D is symmetric. However, in causal analysis, causations $$A\to D$$ and $$D\to A$$ are different. They are asymmetric. Assessing causation is to consider the effect of intervention. Causation $$A\to D$$ indicates that the effect of A is to give rise to disease. However, disease status D will not generate allele A. Suppose that locus A is disease locus and $$\,A\to D$$. If we change the allele a to allele A, then we assume that biological mechanism $$P(D|A)$$ responsible for giving rise to disease. This would hold true independent of the distribution (frequencies)_of allele A. If the locus A is disease locus, we can find that the distributions (frequencies) of allele A in two different populations are different, but the mechanism $$P(D|A)$$ would apply in two populations. The conditional probability $$P(D|A)$$ can also be viewed as penetrance of the allele. The marginal distribution $$P(A)$$ and conditional distribution $$P(D|A)$$ contain no information about each other. Both continuous and discrete ANMs satisfy the ICM and will be used for GWCS., The proposed method for genome-wide causation analysis and inferring multilevel causal genotype-methylation-expression-phenotype-disease network was applied to the ROSMAP dataset19 with 432 individuals, 19 phenotypes and environments, two diseases (AD, T2DM), 299 pathways with RNA-Seq in KEGG pathway database, 20,242 methylation genes with 364,661 CpG sites, and 51, 060 genotyped genes with 5,711,541 SNPs (4,283,876 common snp, 1,427,665 rare snp) (imputed by 1000 Gnome Data). The inferred genotype-expression-methylation-phenotype-disease network consisted of 2,814 nodes and 22,184 edges where the edges were presented in the network if the path coefficients were significantly from zero with P-values < 0.05.

There were two ways to connect a gene (or SNP) to AD (T2DM). If a gene (or SNP) showed causation to AD (T2DM) by statistical causal test, then the gene (SNP) was directly connected to AD (T2DM) in the causal network. Such gene (SNP) was called AD (T2DM) directly connected gene (SNP). We may observe the connection between a gene (SNP) and AD (T2DM) via multiple edges (paths) in the constructed multilevel causal network. Then, the gene (SNP) that was indirectly connected to AD (T2DM) via paths in the multilevel causal network was called AD (T2DM) indirectly connected gene (SNP).

### Genome-wide causation studies

Unlike GWAS where we test the association of each variant across the genome with the disease, genome-wide causation studies (GWCS) test the causation of each variant across the genome to the disease. Additive noise models (ANMs) with discrete variables will be used for GWCS54,62,63,64. The procedures that use the ANMs for GWCS are summarized as follows14,54,63.

### Procedures for causal genetic analysis using ANM

1. 1.

Fit the following nonlinear integer regression to the data.

$$Y=f(X)+{N}_{Y}.$$

Calculate the residuals $${\hat{N}}_{Y}=Y-\hat{f}(X)$$.

2. 2.

Fit the following nonlinear integer regression to the data.

$$X=g(Y)+{N}_{X}.$$

Calculate the residuals $${\hat{N}}_{X}=X-\hat{g}(Y)$$.

3. 3.

Test for independence.

The contingence table and Fisher’s exact test can be used to test independence. Let the statistic for testing the independence between $${\hat{N}}_{Y}$$ and X as $${\Delta }_{X\to Y}$$ and the statistic for testing the independence between $${\hat{N}}_{X}$$ and Y as $${\Delta }_{Y\to X}$$.

The null hypothesis for testing the causation of the variant is

H0: no causation between variables X and Y.

The statistic for testing the causation between two X and Y is defined as

$${T}_{C}=|{\Delta }_{X\to Y}-{\Delta }_{Y\to X}|.$$

When $${T}_{C}$$ is large, the causation between genetic variant $$X$$ and disease status $$Y$$ exists. When $${T}_{C}\approx 0$$, this indicates that no causal decision can be made. Since the distribution of the test statistic $${T}_{C}$$ is difficult to calculate, P-value for testing the causation of the variant $$X$$ can be calculated by permutations.

To improve the performance of causation analysis of rare variants, we first calculate the functional principle component score (FPCS) of the rare variants within a gene64 to summerize information of all rare variants within the gene. Then, the continuous FPCS are discreterized. Finally, the ANMs with discrete variables can be used to test causation of discreterized FPCS with the disease.

### Structural equations for construction of causal networks

Directed graphical models and structural equations can be used as a tool to model the complex causal structures among variables64,66. A graphical model consists of nodes and edges. The nodes represent variables and edges represent the dependence structures among variables. A directed graphic model is defined as the graph in which all the inter-node connections have a direction visually denoted by an arrowhead. Directed acyclic graphics (DAGs) are defined as directed graphics with no cycles. In other words, we can never start at a node X, travel edges in the directions of the arrows and get back to the node X. A DAG with nodes encodes conditional dependence structure of the variables $${Y}_{1},\ldots {Y}_{n}$$. We define the parents of a node as the nodes pointing directly to it. The concept of parents provides an easy way to read off conditional independence from DAGs.

Traditional regressions describe one-way or unidirectional relationships among variables in which the variables on the left sides of the equations are dependent variables and the variables on the right sides of the equations are explanatory variables or independent variables. The explanatory variables are used to predict the outcomes of the dependent variables. However, in many cases, there are two ways, or simultaneous relationships between the variables. Variables in some equations are response variables, but will be predictors in other equations. The variables in equations may influence each other. It is difficult to distinguish dependent variables and explanatory variables. The structural equation models (SEMs) are a powerful mathematic tool to describe such data generating mechanism and infer causal relationships among the variables.

The SEMs classify variables into two class variables: endogenous and exogenous variables. The jointly dependent variables that are determined in the model are called endogenous variables. The explanatory variables that are determined outside the model or predetermined are called exogenous variables. In the genotype-phenotype networks, the phenotype variables such as BMI, cognitive function, working memory, are endogenous variables, age, sex, race, environments and genotypes are exogenous variables. In the genotype-expression networks, the gene expressions are endogenous variables and genotypes are exogenous variables. In the methylation-expression networks, gene expressions are endogenous variables and methylations are exogenous variables.

We consider M endogenous variables. Assume that n individuals are sampled. We denote the n observations on the M endogenous variables by the matrix $$Y=[{y}_{1},{y}_{2},\ldots ,{y}_{M}]$$, where $${y}_{i}={[{y}_{1i},{y}_{2i},\ldots ,{y}_{ni}]}^{T}$$ is a vector of collecting n observation of the endogenous variable i. Exogenous variables are denoted by $$X=[{x}_{1},{x}_{2},\ldots ,{x}_{M}]$$ where $${x}_{i}={[{x}_{1i},{x}_{2i},\ldots ,{x}_{ni}]}^{T}$$. Similarly, random errors are denoted by E $$=\,[{e}_{1},{e}_{2},\ldots ,{e}_{M}]$$ where we assume $$E[{e}_{1}]=0$$ and $$E[{e}_{i}{{e}_{i}}^{T}]={\sigma }_{i}^{2}{I}_{n}$$ for $$i=1\ldots M$$ The linear structural equations for modeling relationships among variables can be written as:

$$\begin{array}{c}{y}_{1}{\gamma }_{11}+{y}_{2}{\gamma }_{21}+\ldots +{y}_{M}{\gamma }_{M1}+{x}_{1}{\beta }_{11}+{x}_{2}{\beta }_{21}+\ldots +{x}_{K}{\beta }_{K1}+{e}_{1}=0\\ \vdots \,\,\,\,\,\,\,\,\,\,\,\vdots \,\,\,\,\,\,\,\,\vdots \\ {y}_{1}{\gamma }_{1M}+{y}_{2}{\gamma }_{2M}+\ldots +{y}_{M}{\gamma }_{MM}+{x}_{1}{\beta }_{1M}+{x}_{2}{\beta }_{2M}+\ldots +{x}_{K}{\beta }_{KM}+{e}_{M}=0\end{array}$$
(1)

where the γ’s and β’s are the structural parameters of the system that are unknown. Variables in the SEMs can be classified into two basic types of variables: observed variables that can be measured and the residual error variables that cannot be measured and represent all other unmodeled causes of the variables. Most observed variables are random. Some observed variables may be nonrandom or control variables (e.g. genotypes, drug dosages) whose values remain the same in repeated random sampling or might be manipulated by the experimenter. The observed variables will be further classified into exogenous variables, which lie outside the model, and endogenous variables, whose values are determined through joint interaction with other variables within the system. All nonrandom variables can be viewed as exogenous variables. The terms exogenous and endogenous are model specific. It may be that an exogenous variable in one model is endogenous in another.

Traditionally, we often select one endogenous variable to appear on the left-hand side of the equation. Specifically, the i-th equation is

$${y}_{i}={y}_{1}{\gamma }_{1i}+\ldots +{y}_{i-1}{\gamma }_{i-1i}+{y}_{i+1}{\gamma }_{i+1i}+\ldots +{y}_{M}{\gamma }_{Mi}+{x}_{1}{\beta }_{1i}+\ldots +{x}_{K}{\beta }_{Ki}+{e}_{i},$$
(2)

where $${\gamma }_{ji}$$ is a path coefficient that measures the strength of the causal relationship from $${Y}_{j}$$ to $${y}_{i}$$, $${\beta }_{ki}$$ is a path coefficient from the exogenous variable to the endogenous variable which measure the causal effect of the exogenous variable $${x}_{k}\,$$ on the endogenous variable $${y}_{i}$$. The coefficients $${\gamma }_{ji}=0\,$$ and $${\beta }_{ki}=0$$ imply the zero direct influence of $${Y}_{j}$$ and $${x}_{k}$$ on $${Y}_{i}$$, respectively and are usually omitted from the equation. Therefore, Eq. (2) is reduced to

$${y}_{i}={Y}_{-i}{\gamma }_{i}+{X}_{i}{\beta }_{i}+{e}_{i}={W}_{i}{\Delta }_{i}+{e}_{i}$$
(3)

where $${Y}_{-i}$$ is a vector of the endogenous variables after removing variable $${Y}_{i}$$, $${\gamma }_{i}$$ is a vector of the path coefficients associated with $${Y}_{-i}$$, and

$${W}_{i}=[{Y}_{-i},\,{X}_{i}],\,{\Delta }_{i}={[{\gamma }_{i},{\beta }_{i}]}^{T}.$$

Multiplying by the matrix $${X}^{T}$$ on both sides of Eq. (3), we obtained

$${X}^{T}{y}_{i}={X}^{T}{W}_{i}{\Delta }_{i}+{X}^{T}{e}_{i}.$$
(4)

Estimation of the parameters in the structural equations is rather complex. It involves many different estimation methods with varying statistical properties. We used two stage least squares (2SLS) method to estimate the parameters. In general, the causal networks are sparse. Using weighted least square and l1-norm penalization of Eq. (4), we can form the following optimization problem to estimate the structure of causal network:

$$\mathop{{\rm{\min }}}\limits_{{\Delta }_{i}}f({\Delta }_{i})+\lambda ||{\Delta }_{i}|{|}_{1}$$

Where

$$f({\Delta }_{i})={({X}^{T}{y}_{i}-{X}^{T}{W}_{i}{\Delta }_{i})}^{T}{({X}^{T}X)}^{-1}({X}^{T}{y}_{i}-{X}^{T}{W}_{i}{\Delta }_{i}).$$
(5)

The alternating direction method of multipliers (ADMM) and proximal methods can be used to estimate the parameters and structure of causal network64,67,68.

### Functional structural equation models for construction of gene-based causal networks

The SEMs carry out variant by variant analysis. However, the power of the traditional variant-by-variant analytical tools for construction of causal networks with rare variants as exogenous variables will be limited. Large simulations have shown that combining information across multiple variants in a genomic region of analysis will greatly enhance the power to infer causal networks with rare variants as exogenous variables. To utilize multi-locus genetic information, we propose to use a genomic region or a gene as a unit in construction of causal networks and develop sparse structural functional equation models (SFEMs) for causal network analysis66.

We define a genotype function. Let t be a genomic position. Define a genotype function $${x}_{i}(t)$$ of the i-th individual as

$${x}_{i}(t)=\{\begin{array}{c}2{P}_{q}(t)\,\\ {P}_{q}\,({\rm{t}})-{P}_{Q}(t)\,\\ -2{P}_{Q}(t)\end{array}\begin{array}{c}QQ\\ Qq\\ qq\end{array}$$

where Q and q are two alleles of the marker at the genomic position t, $${P}_{Q}({\rm{t}})\,$$ and $${P}_{q}({\rm{t}})$$ are the frequencies of the alleles Q and q, respectively. Suppose that we are interested in k genomic regions or genes $$[{a}_{j},\,{b}_{j}]$$ denoted as $${T}_{j},\,j=1,\ldots ,K$$. We consider the following functional structural equation models (FSEMs):

$$\begin{array}{c}{y}_{1}{\gamma }_{11}+{y}_{2}{\gamma }_{21}+\ldots +{y}_{M}{\gamma }_{M1}+{\int }_{{T}_{1}}{x}_{1}(t){\beta }_{11}(t)dt+\ldots +{\int }_{{T}_{k}}{x}_{k}(t){\beta }_{k1}(t)dt+{e}_{1}=0\\ \vdots \,\,\,\,\,\,\,\,\,\,\,\vdots \,\,\,\,\,\,\,\,\vdots \\ {y}_{1}{\gamma }_{1M}+{y}_{2}{\gamma }_{2M}+\ldots +{y}_{M}{\gamma }_{MM}+{\int }_{{T}_{1}}{x}_{1}(t){\beta }_{1M}(t)dt+\ldots +{\int }_{{T}_{k}}{x}_{k}(t){\beta }_{kM}(t)dt+{e}_{M}=0\end{array}$$
(6)

where $${\beta }_{ij}(t),\,j=1,\ldots ,k,\,i=1,\ldots ,M$$ are genetic effect functions.

Functional principal components (FPCs) are efficient summary statistics. The FPCs simultaneously employs genetic information of the individual variants and correlation information (LD) among all variants. For each genomic region or gene, we use functional principal component analysis to calculate principal component function. Let N be the number of sampled individuals. We expand $${x}_{nj}(t),\,n=1,\ldots ,N,\,j=1,\ldots ,k$$ in each genomic region in terms of orthogonal principal component functions:

$${x}_{nj}(t)=\,\mathop{\sum }\limits_{l=1}^{{L}_{j}}{\eta }_{njl}{\phi }_{jl}(t),\,j=1,\ldots ,k,$$

where $${\phi }_{jl}(t),\,j=1,\ldots ,k,\,l=1,\ldots ,{L}_{j}$$ are the l-th principal component function in the j-th genomic region or gene and $${\eta }_{njl}$$ are the functional principal component scores of the n-th individual. Using the functional principal component expansion of $${x}_{nj}(t)$$, we can transform the FSEMs (6) into the traditional multivariate SEMs (1).

### Integer programming for causal network learning

Given the dataset, learning causal networks is the task of finding network structures that best fits the data57,64. We used “score and search” methods to learn causal networks via maximizing the score metrics that characterize the causal networks. The “score and search” algorithms consist of two parts: (1) formulate objective function (global score for the whole network) using the score function for each node and (2) search algorithm.

We collected all nodes with directed edges in the causal network into a DAG, denoted as $$G=(V,E)$$. The score (objective function) for the DAG G was defined as

$$Score\,(G)=\sum _{j\in V}Scor{e}_{j}(G)m$$

where $$Scor{e}_{j}(G)$$ was a score for the node j in the network. The $$Scor{e}_{j}(G)$$ was calculated as $$f({\Delta }_{j})$$ via solving the optimization problem (5). Therefore, the total score can be decomposed into a sum of score for all nodes in the DAG. In addition, the $$Scor{e}_{j}(G)$$ is entirely determined by the parent set of the node j in G. A DAG can be encoded by the set $$W=\{{W}_{1},\ldots ,{W}_{p}\}$$ of parent variables for all nodes V in the graph G. We use $$C(j,{W}_{j})$$ to denote a score function for the pair of node j and its parent set $${W}_{j}$$. Therefore, the total score for the DAG G was given by

$$C(D)=\sum _{i\in V}C(v,{W}_{v}).$$

The learning task is to find a DAG that optimizes the global score C(D) over all possible DAGs D or parent sets57:

$$\mathop{\min }\limits_{D}\sum _{i\in V,\,{W}_{v}\in D}C(v,{W}_{v}).$$

Integer linear programming (ILP) was used as a search algorithm57. A DAG learning was formulated as the ILP as follows. We define a variable $$x({W}_{v}\to v)$$ to indicate the presence or absence of the parent set $${W}_{v}$$ in the DAG. In other words, $$x({W}_{v}\to v)=1$$ if and only if it is the parent set for the node v. The parent set $${W}_{v}$$ can be an empty set. The objective function for the ILP formulation of a DAG learning can be defined as

$$\mathop{\sum }\limits_{v=1}^{P}\mathop{\sum }\limits_{{j}_{v}=1}^{{J}_{v}}C(v,{W}_{{j}_{v}})x({W}_{{j}_{v}}\to v).$$
(7)

The goal was to find a candidate parent set $${W}_{v}$$ for each node $$v$$ by optimizing the objective function in (7). It is clear that every DAG can be encoded by a zero-one indicator variable. However, any set of zero-one numbers may not encode a DAG. A set of linear constraints must be posted to make the set of indicator variables to represent a DAG. Without constraints all indicator variables for the parent sets will be equal to either zero or one. These solutions will not form a DAG. The constraints need to be imposed to ensure that the solutions encode a DAG. This constraint that is referred to as convexity constraint, can be expressed as

$$\mathop{\sum }\limits_{{i}_{j}=1}^{{I}_{j}}x({W}_{{i}_{j}}\to j)=1,\,j=1,\ldots ,p$$
(8)

The convexity constraints (8) can define a directed graph. However, the generated directed graph may have cycles. To eliminate a cycle, we need to impose the following constraint to ensure that any subset C of the nodes V in a DAG must contain at least one node that has no parent in the subset C

$$\forall \,C\subseteq \sum _{j\in C}\sum _{W:W{\cap }^{}C=\varnothing }x(W\to j)\ge 1,$$
(9)

which is referred to as cluster-based constraints. Our goal is to find a candidate parent set Wj for each node j by optimizing objective function (7) subject to the constraints (8) and (9).

The branch and bound method is a popular algorithm ensured to find an optimal solution to the 0–1 ILP problem57. Let the LP solution represent “solution of the current linear relaxation”. The basic idea of the branch and bound method is to successively divide the ILP problem into smaller problems that are easy to solve and reduce the search space. Briefly, the branch and bound algorithm is summarized as follows. Step 1: Let $$\hat{x}$$ be the LP solution. Step 2: if there are, valid constraints not satisfied by $$\hat{x}$$ add them and go to Step 1; otherwise if the solution $$\hat{x}$$ is an integer then stop, the current problem is solved; otherwise branch on a variable with a non-integer part in $$\hat{x}$$ to generate two new sub-IP problems. We then again use branch and bound algorithms to solve two sub-ILP problems57.

### Multilevel causal networks

Multilevel causal omics networks integrated genotype subnetworks, methylation subnetworks, gene expression subnetworks, the intermediate phenotype subnetworks and multiple disease subnetworks into a single connected multilevel genotype-disease networks to reveal the deep causal chain of mechanisms underlying the diseases64. ILP was extended from a single causal network estimation to joint multiple causal network estimations to integrate genomic, epigenomic and phenotype data.

For the convenience of discussion, consider M gene expression variables $${Y}_{1},\ldots ,{Y}_{M},$$ Q methylation variables $${Z}_{1},\ldots ,{Z}_{Q}$$, and K genotype variables $${X}_{1},\ldots ,{X}_{K}$$. Let $$p{a}_{D}(D)\,$$ be the parent set of the node d including gene expression, methylation and genotype variables. Consider three types of SEMs. First, we consider a general SEM model for the gene expression:

$${Y}_{d}=\sum _{i\in p{a}_{D}(D)}{f}_{di}({Y}_{i})+\sum _{q\in p{a}_{D}(D)}{f}_{dq}({Z}_{q})+\sum _{j\in p{a}_{D}(D)}{f}_{dj}({X}_{j})+{\varepsilon }_{d},\,d=1,\ldots M$$
(10)

And

$${Z}_{q}=\sum _{l\in p{a}_{Q}(q)}{f}_{ql}({Z}_{l})+\sum _{m\in p{a}_{Q}(q)}{f}_{qm}({X}_{m})+{\varepsilon }_{q},\,q=1,\ldots Q$$
(11)

where $${f}_{d}$$ and $${f}_{q}$$ are linear functions from $${R}^{|p{a}_{D}|}\to R$$ and $${R}^{|p{a}_{Q}|}\to R$$, respectively, and the errors $${\varepsilon }_{d}$$ and $${\varepsilon }_{q}$$ are independent, following distributions $${P}_{{\varepsilon }_{d}}$$ and $${P}_{{\varepsilon }_{Q}}$$, respectively. Equation (10) define a causal network that connects gene expressions, methylations and genotypes. Equation (11) define a causal network that connects methylations and genotypes.

### Integer programming as a general framework for joint estimation of multiple causal networks

We collected multiple types of data: genotype, gene expression, methylation, and phenotype and disease data. We wanted to estimate multiple causal networks with different types of data64.

The scores of the nodes $${Y}_{d}$$ and $${Z}_{q}$$ were, respectively, given by

$$C({Y}_{d},{W}_{di})={Y}_{d}^{T}(I-{D}_{Y}^{i}{({({D}_{Y}^{i})}^{T}{D}_{Y}^{i})}^{-1}{({D}_{Y}^{i})}^{T}){Y}_{d}$$
(12)

and

$$C({Z}_{q},{W}_{ql})={Z}_{q}^{T}(I-{D}_{Z}^{l}{({({D}_{Z}^{l})}^{T}{D}_{Z}^{l})}^{-1}{({D}_{Z}^{l})}^{T}){Z}_{q}$$
(13)

where matrices $${D}_{Y}^{i}$$ and $${D}_{Z}^{l}$$ corresponded to the parent sets $${W}_{di}$$ and $${W}_{ql}$$.

Let $${V}_{E}$$ be the set of nodes in the gene expression network and $${V}_{M}$$ be the set of nodes in the methylation network. Let $${C}_{E}$$ be a subset of nodes in $${V}_{E}$$ and $${C}_{M}$$ be a subset of nodes in $${V}_{M}$$. A joint expression and methylation causal network can be formulated as the following ILP:

$$\begin{array}{c}\min \,\mathop{\sum }\limits_{d=1}^{M}\sum _{i\in p{a}_{D}(d)}C(d,{W}_{di})x({W}_{di}\to d)+\mathop{\sum }\limits_{q=1}^{Q}\sum _{l\in p{a}_{Q}(q)}C(q,{W}_{ql})x({W}_{ql}\to q)\\ {\rm{s}}{\rm{.t}}.\,\,\sum _{i\in p{a}_{D}(d)}x({W}_{di}\to d)=1,\,d=1,\ldots ,M,\,\\ \sum _{i\in p{a}_{Q}(q)}x({W}_{ql}\to q)=1,\,q=1,\ldots ,Q,\\ \forall \,{C}_{E}\subseteq {V}_{E}:\sum _{d\in {C}_{E}}\sum _{{W}_{d}:{W}_{d}{\cap }^{}{C}_{E}=\varnothing }x({W}_{d}\to d)\ge 1,\\ \forall \,{C}_{M}\subseteq {V}_{M}:\sum _{q\in {C}_{M}}\sum _{{W}_{q}:{W}_{q}{\cap }^{}{C}_{M}=\varnothing }x({W}_{q}\to q)\ge 1.\end{array}$$
(14)

Using branch and bound and other methods for solving the ILP, we can solve the ILP problem (14) to obtain the best joint causal genotype-methylation-expression and genotype-methylation network fitting the data.

### Summary statistics for representation of groups of gene expressions

Generalized low rank models were used to segment (cluster) the data. Principal component analysis (PCA) was used to reduce data dimensions. The PCs were used to summarize the gene expression data in pathways and clusters69.

### Simulations of causal networks

We simulated causal networks with genes (genotype) and gene expressions as the nodes of the networks. We randomly selected 8 genes (30 node model) and 10 genes (50 node model) from the ROSMAP dataset. The genotype information of multiple SNPs within a gene was summarized by FPCA scores which were taken as the values of the gene node. We used R package PCALG70,71 to randomly generate DAG with 30 nodes (edges ranging from 70 to 90), and with 50 noes (edges ranging from 80 to 110). The values of the gene expression nodes were generated by the following model66,70,71,72

$${y}_{i}=\sum _{j\in pa({y}_{i})}{\gamma }_{ji}{y}_{j}+\sum _{k\in pa({y}_{i})}{\beta }_{ki}{x}_{k}+{e}_{i},\,i=1,\ldots ,M,$$

where $$pa({y}_{i})$$ is the set of parents of the node $${y}_{i}$$, the coefficients $${\gamma }_{ji}$$ and $${\beta }_{ki}$$ followed a uniform distribution $$u(1,2)$$, $${e}_{i}$$ followed a normal distribution $$N(0,1)$$.

A total of 100, 300, 500 and 1,000 DAGs were generated. The number of replication was 1,000. Let $${N}_{t}$$ be the total number of edges among simulated DAGs, $${N}_{0}$$ the total number of edges that were not presented in the simulated DAGs, $${N}_{True}$$ the total number of edges detected by the algorithm and $${N}_{False}$$ the false edges directed among $${N}_{0}$$. Then the false discovery rate (FDR) was defined as $$\frac{{N}_{False}}{{N}_{0}}$$ and power of detection defined as $$\frac{{N}_{True}}{{N}_{t}}$$.

### Ethical approval and informed consent

The ROSMAP studies were approved by the Institutional Review Board of Rush University Medical Center. Written informed consent was obtained from all subjects, followed by an Anatomic Gift Act for organ donation.