Prioritization of schizophrenia risk genes from GWAS results by integrating multi-omics data

Schizophrenia (SCZ) is a polygenic disease with a heritability approaching 80%. Over 100 SCZ-related loci have so far been identified by genome-wide association studies (GWAS). However, the risk genes associated with these loci often remain unknown. We present a new risk gene predictor, rGAT-omics, that integrates multi-omics data under a Bayesian framework by combining the Hotelling and Box–Cox transformations. The Bayesian framework was constructed using gene ontology, tissue-specific protein–protein networks, and multi-omics data including differentially expressed genes in SCZ and controls, distance from genes to the index single-nucleotide polymorphisms (SNPs), and de novo mutations. The application of rGAT-omics to the 108 loci identified by a recent GWAS study of SCZ predicted 103 high-risk genes (HRGs) that explain a high proportion of SCZ heritability (Enrichment = 43.44 and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = 9.30 \times 10^{ - 9}$$\end{document}p=9.30×10−9). HRGs were shown to be significantly (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{\mathrm{adj}} = 5.35 \times 10^{ - 7}$$\end{document}padj=5.35×10−7) enriched in genes associated with neurological activities, and more likely to be expressed in brain tissues and SCZ-associated cell types than background genes. The predicted HRGs included 16 novel genes not present in any existing databases of SCZ-associated genes or previously predicted to be SCZ risk genes by any other method. More importantly, 13 of these 16 genes were not the nearest to the index SNP markers, and them would have been difficult to identify as risk genes by conventional approaches while ten out of the 16 genes are associated with neurological functions that make them prime candidates for pathological involvement in SCZ. Therefore, rGAT-omics has revealed novel insights into the molecular mechanisms underlying SCZ and could provide potential clues to future therapies.

links 4 . RNA sequencing data on the developmental stage specificity of expression of genes in brain 25 tissues were downloaded from BrainSpan (https://www.brainspan.org). The gene expression level 26 was calculated by the average expression of five brain regions (frontal lobe, temporal lobe, parietal 27 lobe, occipital lobe and cerebellum) in two developmental stages (adolescence and adulthood). 28 DNMs were downloaded from four previous studies 5-8 , from which we extracted 698 DNMs with 29 intragenic mutations including nonsense, splicing, frameshift and missense associated with SCZ. 30 SCZ-related gross deletions were downloaded from the Human Gene Mutation Database (HGMD) 9 ; 31 in total, we obtained 149 gross deletions involving 73 genes. 32 All missing data in these data sources were filled in by means of the K-Nearest Neighbors 33 algorithm. 34 35 Gene expression analysis on tissue and cell specificities 36 The tissue specificity levels of genes were calculated by a previously described strategy 10  To investigate the cell specificity of HRGs in brain cell types, we downloaded the cell specificity 47 The PPI network were obtained from the BioGRID database 15 that includes protein interactions from 98 28 experimental systems (Supplementary Materials). The weight matrix was computed as = 99 , where is the total number of experimental systems supporting the interactions between proteins 100 and ; is the total number of proteins interacting with protein . 101 The tissue-specific networks were obtained from the TissueNet database covering four brain tissues 102 (cerebral cortex, cerebellum, hippocampus, and lateral ventricle) 16  The weighted network was used in RWR to calculate the distance between the gene from locus 107 l and the risk genes. First, we set a restarting rate , which means the gene either moving to other 108 genes with a probability 1 − or moving back to itself with a probability . was set to 0.3 here. 109 Let , be the probability of all genes being reached at step starting with gene . The 110 probability for the next step of moving can be formalized as , +1 = (1 − ) , + , where 111 is a vector with the -th element as 1 and other elements as 0, and ,0 is set to be the interaction 112 weights of gene with all other genes in the weight matrix . This process was iterated until 113 | , +1 − , | < , where is a predefined threshold set as 1 − 6 , and gives , the 114 moving probability of gene l to other genes.
were collected into the movement probabilities of 115 is the number of genes involved in the network. 116 Above process was performed through three kinds of networks, GO network, BioGRID network, 117 and TissueNet network, respectively. If a pair of genes were found to be interacting in more than 118 one networks, the scores in calculated by RWR were summed. The conditional probability of 119 gene contributed by , ( | − , ) can be calculated as the sum of the sub-vectors extracted 120 from , which are rows for and columns for − in . 121

Relation of high risk genes predicted by GO and PPI networks with multi-omics data 123
Integrating the GO network, the BioGRID network and the network from hippocampus predicted 124 106 HRGs and 828 LBGs. We found that these HRGs were associated with nine gross deletions 125 whereas the LBGs have not been reported to harbour any gross deletions (one-sided Fisher's exact 126 test: = 1.43 × 10 −3 ). Moreover, the HRGs contained more enhancers in DRE-promoters from 127 CapHiC (one-sided Wilcoxon rank-sum test: = 8.28 × 10 −4 , Fig.S1 and Fig.S2a) than the 128 LBGs. When we integrated the gross deletion information and the CapHiC data into rGAT-omics, 129 we observed that the HRGs are more enriched in DNMs, have more DRE-promoters, and are more 130 enriched in differentially expressed genes (DE) than the LRGs (Fig.S2b). We also noted that the 131 HRGs were more likely to exhibit developmental stage specificity in all developmental stages (fetal, 132 infancy, childhood, adolescence and adulthood) than the LBGs according to a stage specificity 133 investigation on BrainSpan data by a one-sided Wilcoxon rank-sum test as shown in Fig.S2c. 134 Next, we investigated the distance between the HRGs identified by the integrated networks to 135 the index SNPs (DTS). Among the 106 HRGs, 23 (21.7%) were the genes in closest proximity to 136 the corresponding index SNPs, which is significantly higher than would be expected by chance 137 alone ( < 10 −6 , permutation test). This indicated that although the nearest genes to the index 138 SNPs were not necessarily risk genes, distance is still a contributory factor in the prediction. 139 140

Involvement of HRGs in biological functions related to SCZ 141
The biological processes (BP) function enrichment test of the HRGs was performed by 142 g:Profiler 17 on 103 HRGs predicted by rGAT-omics. In all, 45 functions were found to be 143 significantly ( < 0.05) enriched by the HRGs, which included 12 functions involving neuron 144 or brain activity ( Fig.S4; Table S3). These functions included the voltage-gated calcium channel 145 and signaling function, which was found to be significantly enriched in HRGs and to involve 146 GPM6A and GRIN2A specifically. The relationship between this function and SCZ has been 147 previously highlighted by a number of studies 18-20 . Here, GPM6A and GRIN2A genes were 148 respectively found to capture 213 and 282 links in CapHiC, and they captured many more links 149 than the average value of 74 for all candidate genes. We also identified HRGs enriched in the 150 function of neurogenesis, involving a total of 26 genes shown in Table S3. Among them, CLU has 151 already been implicated in SCZ 2 . Another biological function, nervous system development, was 152 recognized as being enriched among the HRGs. This function was exhibited by 38 HRGs that 153 included genes already known to be closely related to neurological disorders (Table S3). For 154 example, GPM6A is involved in stress-induced hippocampal alterations in psychiatric disorders 19 155 whereas abnormal expression of ZNF804A has been implicated in increased susceptibility to SCZ 21 . 156 Moreover, function brain development was associated with the HRGs, including specifically the 157 ATP6V0D1 gene. ATP6V0D1, which encodes a component of vacuolar ATPase, has been found to 158 be upregulated in the dorsolateral prefrontal cortex from patients with SCZ by comparison with 159 controls 22 . These results suggested that the HRGs identified by rGAT-omics are genuine SCZ risk 160 genes and potentially reflect underlying pathogenic mechanisms. 161 Importantly, 23 genes were predicted as HRGs by rGAT-omics but not included in the SCZ-162 related gene sets (Table S4). Among them, three genes, BTG1, MMP16 and BRINP2, have been 163 reported as being expressed significantly differently between SCZ patients and controls. Briefly, 164 the BTG1 gene has been reported as a blood biomarker of SCZ and has also been found to be 165 differentially expressed in the brains of SCZ patients 23, 24 ; the Mmp16 gene has been reported to 166 show altered expression in brain tissues of a SCZ rat model compared to controls 25 ; the BRINP2 167 gene has been found to be differentially expressed between the SCZ patient group and healthy 168 controls 26 . Additionally, SNX8 has been reported as a β-amyloid (Aβ) toxicity enhancer and 169 associated with Alzheimer disease 27 ; finally, CRBN has been demonstrated to play a role in tau 170 protein accumulation in frontotemporal dementia (FTD) 28 . 171

High involvement of HRGs in targets of nervous system drugs 172
Drug-Protein binding data were downloaded from BindingDB (www.bindingdb.org/bind/ 173 index.jsp) and DrugBank (https://www.drugbank.ca/releases/latest). Data were first filtered by four 174 separate criteria: (1) The protein target is human (Homo sapiens); (2) The binding pocket of target 175 protein is constituted by only one chain; (3) The protein target has at least one available Uniprot 176 ID; (4) The ligand-protein binding affinity is less than 10 μM (either one of Ki, Kd, IC50 and EC50). 177 The ligands in SMILES format were transformed into InChiKey representing each ligand by 178 OpenBabel 29 to avoid more than one SMILES representing one ligand. This yielded 523,611 pairs 179 of ligand-protein interactions in BindingDB, which comprised 368,313 ligands binding to 1,572 180 protein targets. Among these ligands, 1,540 have a DrugBank ID. By using DrugBank ID, we 181 queried DrugBank and obtained 15,750 ligand-protein interactions comprising 4,371 ligands 182 binding to 2,438 protein targets. We next combined the interactions from BindingDB and DrugBank, 183 and obtained 537,720 unique ligand-protein interactions. The ligands with ATC code "N" in the 184 first-class were considered to be drugs with nervous system effect. 185 In total, 2,978 proteins were collated as targets for 368,286 drugs employed in the treatment of 186 diseases of the human nervous system after merging interactions across the two databases. Among 187 them, 24 (23.3%) out of 103 HRGs constitute targets for 4,054 nervous system drugs, which is a 188 significant enrichment compared to LBGs ( = 1.90 × 10 −3 , = 2.25). These results suggest 189 the potential involvement of HRGs in the etiology of SCZ and other neurological diseases. 190 Eleven of the 24 HRGs were not identified by iRIGS. For example, FTL was missed by iRIGS; 191 FTL encodes the light subunit of the ferritin protein and is the target of three drugs (Protoporphyrin,192 Sodium ferric gluconate complex and Ferric pyrophosphate citrate). The FTL gene product is 193 known to be associated with a neurodegenerative disorder related to iron accumulation in the brain 30 . 194 YWHAE, a target of the drug fusicoccin, plays an important role in diverse biochemical activities 195 related to signal transduction and has been shown to be associated with various brain abnormalities, 196 learning disabilities and seizures 31 ; the PPP3CA gene, a target of the drug Myristic acid, has been 197 found to cause a severe neurodevelopmental disease with seizures and childhood 198 neurodevelopmental disorders 32 . The drug targets missed by iRIGS are shown in Table S5. 199 200 Novel high risk genes predicted by rGAT-omics and related with neurological disorders 201 rGAT-omics predicted 16 genes (Table S4) as novel candidates associated with SCZ, including 202 ten genes reported as being associated with neurological disorders. The novel SCZ gene SNX8 has 203 been reported to be a β-amyloid (Aβ) toxicity enhancer associated with Alzheimer disease 33 Table S9 Novel genes interacted with non-novel genes Table S10 Candidate genes with PP ranked in the top 10% predicted by rGAT-omics. 279 Index SNP Gene symbol chr1_8424984_D ALDOA, MAZ, MAPK3, SEZ6L2, SRCAP, RNF40, CDIPT, KCTD13,  TMEM219, PPP4C, YPEL3, CORO1A, TBC1D10B