A new method for identifying causal genes of schizophrenia and anti-tuberculosis drug-induced hepatotoxicity

Schizophrenia (SCZ) may cause tuberculosis, the treatments for which can induce anti-tuberculosis drug-induced hepatotoxicity (ATDH) and SCZ-like disorders. To date, the causal genes of both SCZ and ATDH are unknown. To identify them, we proposed a new network-based method by integrating network random walk with restart algorithm, gene set enrichment analysis, and hypergeometric test; using this method, we identified 500 common causal genes. For gene validation, we created a regularly updated online database ATDH-SCZgenes and conducted a systematic meta-analysis of the association of each gene with either disease. Till now, only GSTM1 and GSTT1 have been well studied with respect to both diseases; and a total of 23 high-quality association studies were collected for the current meta-analysis validation. Finally, the GSTM1 present genotype was confirmed to be significantly associated with both ATDH [Odds Ratio (OR): 0.71, 95% confidence interval (CI): 0.56–0.90, P = 0.005] and SCZ (OR: 0.78, 95% CI: 0.66–0.92, P = 0.004) according to the random-effect model. Furthermore, these significant results were supported by “moderate” evidence according to the Venice criteria. Our findings indicate that GSTM1 may be a causal gene of both ATDH and SCZ, although further validation pertaining to other genes, such as CYP2E1 or DRD2, is necessary.


Results
Identification of causal genes for both ATDH and SCZ. Five known ATDH/ATDILI-related genes and 1, 305 known SCZ-related genes were collected from GenBank. The human STRING network database was then used to map five and 1, 079 genes were mapped on the network for ATDH and SCZ, respectively. Using these mapped genes, we predicted 3, 045 and 1, 458 possible ATDH-and SCZ-related genes, respectively, with potential effects on known disease-related genes. Among these, we identified 878 overlapping genes between the expanded ATDH and SCZ gene lists, which are shown in Supplementary Table S3. Furthermore, 500 genes with significant false discovery rate (FDR-) corrected hypergeometric test P values (< 10 −8 ) were identified as causal factors for both ATDH and SCZ (Supplementary Table S4).
To validate these 500 genes as common causal factors for both ATDH and SCZ, we created a regularly updated online database ATDH-SCZgenes (www.bio-x.cn/ATDH-SCZgenes.html) and conducted a field synopsis/systemic meta-analysis to analyze associations of polymorphisms in each potential gene with ATDH or SCZ. During validation, GWAS data were firstly collected, and then, candidate gene association studies data were collected. To date, no GWAS for ATDH have been reported yet. Among these genes, only GSTM1, CYP2E1 and glutathione S-transferase theta-1(GSTT1) (P = 5.61E-22, 4.54E-18 and 3.87E-09, respectively, Table 1) have been reported to associate with both ATDH and SCZ; all other genes were reported to associate with at most either disease alone. Genes with significant effects only on SCZ, which were obtained from the SzGene database 17 or a systemic meta-analysis of genome wide association study (GWAS) data provided by Ricopili 18 , are listed in Supplementary  Table S5; in the future, association of these genes with ATDH will require testing.

Characteristics of the included studies.
A flow diagram summarizing the study selection process is shown in Supplementary Fig. 1. A total of 33 and 24 potentially relevant studies regarding the association between CYP2E1/GSTM1/GSTT1 polymorphisms and the respective risks of ATDH and SCZ were identified after an initial screening based on the titles and/or abstracts of the candidate articles. After the second screening, 699 cases and 2,546 controls from fifteen studies of CYP2E1 and ATDH, 679 cases and 2,289 controls from fourteen studies of GSTM1 and ATDH, 592 cases and 2,569 controls from fourteen studied of GSTT1 and ATDH, one case-control study of CYP2E1 and SCZ, 1,469 cases and 1,605 controls from seven studies of GSTM1 and SCZ, and 936 cases and 971 controls from five studies of GSTT1 and SCZ were identified. The detailed characteristics of each study were listed in Supplementary Table S6. All studies confirmed the same complete loss of GSTM1 or GSTT1 mutation. Because the association studies involving CYP2E1did not meet the fifth inclusion criterion (i. e., at least three studies regarding the association of each gene with either disease), studies involving this gene were omitted (detailed information about CYPE2E1 can be obtained from the online database). The genotype distributions of the cases and controls from all studies involving GSTM1 or GSTT1 in the context of ATDH and SCZ are presented in Tables 2 and 3, respectively. The null genotype refers to homozygous gene loss, which indicates a loss of gene function, and the present genotype includes both heterozygous gene loss and homozygous complete gene presence.

Number of neighbors
Association of GST polymorphisms with ATDH. Evaluations of the associations between GSTM1/ GSTT1 polymorphisms and the risk of ATDH are summarized in Fig. 1. Significant heterogeneity in the effects of these polymorphisms was observed for GSTM1, but not for GSTT1 [P = 0.088 and I 2 = 36%, 95% confidence interval (CI): 0-0.662 for GSTM1, P = 0.12 and I 2 = 32%, 95% CI: 0-0.61 for GSTT1]. Because fewer than 20 studies were included in the meta-analysis, the random-effect model was used for both GSTM1 and GSTT1.

Sensitivity analyses and publication bias.
A sensitivity analysis was conducted via sequential analysis after omitting one study at a time to assess the effects of individual studies on the overall meta-analysis estimate. When one study was excluded, the P values for overall effects ranged from 0.004 to 3.97E-5 and from 0.17 to 0.68, in the GSTM1/GSTT1 and ATDH fixed-effect model analyses, respectively; for the fixed-effect model analyses of GSTM1/GSTT1 and SCZ, the respective P values for overall effects ranged from 0.001 to 0.015 and from 0.005 to 0.32. These values indicate the stability of these analytical results. Furthermore, Harbord's test indicated no significant publication bias in the overall meta-analysis except for studies of the association between GSTT1 and ATDH and the association of GSTM1and SCZ (P = 0.56 for GSTM1 vs. ATDH; P = 0.08 for GSTT1 vs. ATDH; P = 0.0637for GSTM1 vs.SCZ; P = 0.91 for GSTT1 vs. SCZ).
Credibility of meta-analysis results. The Power and Sample Size Program 19 indicated that the total sample size had a power > 90% to detect significant associations of the GSTM1 present genotype with ATDH and SCZ  Table 3. Genotype distributions of GSTM1/T1 polymorphisms among SCZ and healthy control. The null genotype means homozygous loss of genes, and the present genotype includes heterozygous loss of genes and homozygous complete genes. * P value for chi-square test of genotype distribution.  Table S7). Moreover, the strict inclusion criteria of this meta-analysis had addressed the genotyping quality. For the meta-analysis of ATDH studies, the n minor for the GSTM1 null genotype was 1,329, and a grade of A was given. The I 2 was 36% and a grade of B was given. After excluding a 2001 study by Roy 20 , a significant association remained between the GSTM1 present genotype and ATDH (P = 0.01), with a Harbord's test P value of 0.868 20 . For the meta-analysis of SCZ studies, the n minor for the GSTM1 present genotype was 1,654, and a grade of A was given. A grade of A was also given for the I 2 of 21%. After excluding 2001 study by Harada 21 , a significant association remained between the GSTM1 present genotype and SCZ (P = 0.014), with a Harbord's test P value of 0.138. According to the Venice criteria 22,23 , "moderate" cumulative evidence supported significant associations of the GSTM1 present genotype with both ATDH and SCZ.

Discussion
As noted previously, ATDH can impede TB treatment schedules and thereby increase complications morbidity 6 . Psychiatric disorders has also been reported to represent an additional adverse effect of anti-TB drugs 5,7 . SCZ is a severe psychiatric disorder, given that TB and SCZ are frequent co-morbid conditions 2,7 , understanding the molecular basis for the relationship between ATDH and SCZ would not only facilitate personalized medicine by allowing physicians to identify potential exacerbation of SCZ and induction of ATDH patients, but could help to elucidate the molecular mechanisms common to both diseases. Previously, however, the relationship between ATDH and SCZ has been unclear, and knowledge about common biological determinants between these conditions has yet to emerge.
Previously, we proposed that GST genes might serve as a link between ATDILI and SCZ 10 . To provide a global perspective of the hidden molecular basis for the connection between ATDH and SCZ, we proposed a protein-protein interaction (PPI) network-based analysis pipeline that would prioritize possible key drivers that might affect both ATDH and SCZ by extending ATDH-related or SCZ-related gene sets to neighboring genes and identifying key causal genes that overlap in these extended gene sets. Although a direct overlap of known gene sets is the most intuitive way of exploring a genetic association of two diseases, such a direct overlap would fail to reflect the complexity of the intertwined regulation between these two diseases. Moreover, because the known disease gene sets are incomplete, the lacking genes could lead to a naive comparison. In contrast, in our proposed method, we qualitatively analyzed the potential nature of a gene as a key driver of two diseases using the hypergeometric test. Furthermore, identified candidate key drivers were ranked according to P value significance and network microenvironments (i.e., potential interaction neighbors). The ability to reconstruct potential causal signaling will facilitate further molecular biology studies.
In addition, identified key drivers can provide clues about therapeutic interventions that affect genes from both diseases. We note that EnrichNet use similar methods to determine the enrichment of one gene set into a particular pathway or other signatures 24 ; these include employing information from the PPI network and extending the seed genes to neighboring gene using the RWR method. However, EnrichNet measures the significance of a relationship between a series of extended genes and a particular pathway according to different RWR distance cutoffs, whereas our method determines the best RWR distance cutoff. Specifically, we adopted the concept of the leading-edge subset used in GSEA, which is a sorted the neighbor gene lists based on RWR distances, and labeled the seed genes as positive and other genes as negative. Using our method, the peak at which the running sum maximally deviates from zero determines the best RWR distance cutoff; subsequently, we can determine the best extended genes for either disease and study the overlaps to identify candidate key causal genes using the hypergeometric test. Therefore, EnrichNet and our analysis pipeline use similar methods in different ways to address different problems. In the present study, we used our novel pipeline analysis to identify 500 genes with a P < 10 −8 as possible causal genetic factors shared by ATDH and SCZ. Given the nature of ATDH, however, it is difficult to collect a sufficient number of patients with both ATDH and SCZ in the absence of other comorbidities. Because systemic meta-analysis is considered as a powerful tool for the identification of genes associated with a certain disease, we have created and regularly updated the online database ATDH-SCZgenes (http://www.bio-x.cn/ ATDH-SCZgenes.html) to analyze and validate the association between candidate genes and both diseases.
Among the 500 evaluated candidate genes, to date, only GSTM1, CYP2E1 and GSTT1 were found to associate with both ATDH and SCZ; all others associated with neither or only one of the diseases. In the analysis of CYP2E1and ATDH, a pooled OR of 1.2(95% CI: 0.85-1.68) was determined for the rs2031920 (− 1053C > T) polymorphism, whereas a fixed-effect model yielded a pooled OR of 1.3(95% CI: 1.06-1.59). Only one previous study has evaluated and identified a positive association between CYP2E1 and SCZ. In contrast, the SzGene database 17 and a systemic meta-analysis of SCZ GWAS data 18 failed to corroborate that significant association, but consistently supported DRD2 to be significantly associated with SCZ. DRD2 has been reported as a prominent genetic risk factor for susceptibility to severe alcoholism 25 , which is associated with liver damage. Further validation of the associations of these genes with ATDH and SCZ is needed in the future.
Through a systemic meta-analysis, we validated a significant association of the GSTM1 null genotype with increased risks of both ATDH and SCZ, and these significant results were supported by "moderate" evidence according to the Venice criteria 22 , suggesting that GSTM1 may be a causal factor shared by both ATDH and SCZ.
ATDH has been widely suggested to be a Glutathione S-transferases (GSTs) related disease 26,27 . Through conjugating glutathione with free radical scavengers and facilitating their elimination from the body to reduce potential toxicities of target substances 28 , GSTs comprise a superfamily of detoxification enzymes that are encoded in two main genes: the GSTM1 gene on chromosome 1p13.3, which encodes for cytosolic GST class Mu 1 enzyme, and the GSTT1 gene on chromosome 22q11.2, which encodes for cytosolic GST class theta 1 enzyme 29,30 . Both GSTM1 and GSTT1 may harbor a null mutation comprising a complete deletion of the respective gene via unequal homologous crossover, and homozygous null mutations can lead to a variable, tissue-specific loss of GSTs activity 31 .
GSTM1 is mainly expressed in the liver and brain 32 , a fact that supports the significant associations observed between this gene and both ATDH and SCZ. Furthermore, GSTM1 not only detoxifies the toxic metabolites of anti-TB drugs generated by CYP2E1 in the liver, but also catalyzes the conjugation of glutathione with aminochrome and dopa-o-quinone metabolites of oxidized dopamine in the brain 32 . Reactive oxygen species are generated at high rates in the brain, and regulation of the growth and pruning of neurons is partly attributed to the redox mechanism that controls the balance between neuro destructive oxidants and neuro protective antioxidants 33 . Therefore, GSTM1 inactivation due to the GSTM1 null genotype not only causes liver injury but also promotes the accumulation of neuro destructive oxidants and consequent development of SCZ (Fig. 3).
Compared with our previous study 10 , the present study exhibits the following improvements: (1) the current meta-analysis or field synopsis is more compressive and systematic and involves as many candidate genes as possible including GST genes; (2) the current analysis is updated regularly using the online database ATDH-SCZgenes; and (3) more strict statistical methods were used in this analysis, including ORs instead of risk ratios and a threshold P value for publication bias of 0.1 instead of 0.05. However, this study also has some limits, including use of the STRING database, a functional association network without direction. The availability of a disease specific directed network might lead to more comprehensive and concrete conclusions using the current method. Additionally, the significant association of GSTM1 with SCZ should be interpreted with caution, as the P value for the publication bias test was 0.0637; this value represents the evidence of small-study effects with all studies included. However, our sensitivity analysis to assess the effects of individual studies on the overall meta-analysis estimate yielded P values for overall effects of 0.001-0.015 in the GSTM1 and SCZ analysis, indicating the stability of positive results obtained with these analyses. Furthermore, till now, no GWASs on ATDH have been reported and the number of reported ATDH-associated genes is small; therefore, only GSTM1 and GSTT1 genes could be validated in the current systemic meta-analysis. Validation of other causal genes must be performed in future studies.
In summary, we provide a list of possible causal genetic factors associated with both ATDH and SCZ, and have identified a shared genetic basis of these two diseases. Furthermore, we have created and will regularly update the ATDH-SCZgenes online database to validate the association of each candidate gene with either disease. Finally, GSTM1 was validated as a causal factor of both ATDH and SCZ, whereas other genes such as CYP2E1 and DRD2 will require further validation.

Methods
Ethics statement. The current research was performed in compliance with the Helsinki Declaration and was approved by the Bioethics committee of the Bio-X Institutes of Shanghai Jiaotong University. Informed consent was obtained from all subjects.

Network-based analysis. All known ATDH/ATDILI-and SCZ-related genes reported (including GWAS)
to have significant effects on the relative disease by at least one study were collected using GenBank (http://www. ncbi.nlm.nih.gov/gene/), and are listed in Supplementary Table S1. These genes were used as seed genes in the subsequent analysis. To explore the causal factors of ATDH and SCZ, a novel network-based analysis pipeline was developed per the workflow shown in Fig. 4. First, established seed genes related to ATDH or SCZ, were mapped onto the highest confidence human STRING network (version 9.1, confidence score > 0.900), which included a total of 8,823 genes 34 . These seed genes were subsequently expanded on the network using the RWR method 35 as follows: a PPI network G = (V, E) comprised of a set of proteins V and a set of interactions E is represented by an n × n adjacency matrix A, where n is the number of proteins. The entries at row i and column j are set to 1 if ATDH genes were expanded using RWR and the known ADTH genes as seed genes. The stop point was determined using a running sum curve reflective of the overlap between the top expanded ATDH genes and known SCZ genes. SCZ genes were expanded in a similar manner. (C) Common genes between ATDH and SCZ were highlighted for further key driver evaluation. (D) The neighbor genes of candidate key drivers were tested for overlap significance with common disease genes. These neighbors of key drivers should significantly affect the more common disease genes.
Scientific RepoRts | 6:32571 | DOI: 10.1038/srep32571 protein i interacts with protein j; otherwise they are set to 0. First, the adjacency matrix A was normalized in a column-wise maner as follows [ , ] 1 [ , ] The random walker initiates at a set of seed genes, (e.g., known disease genes). The initial state P 0 can be formulated as a column vector where ψ i is set to 1/m for m seed genes and to 0 for the other genes on the network, and n is the number of genes on the network. The random walker randomly visits the adjacent genes for every t → t + 1. The state probabilities P t+1 at time t + 1 are calculated as follows where P t represents the state probabilities at time t, r is the restart probability (i.e., starting from the seed genes again), which was set to 0.7 as suggested by multiple previous studies [36][37][38][39][40][41] . This process was repeated until a steady-state was reached; this was defined as a difference between two steps of < 1e-6 according to previous studies 38,42,43 .
After expansion, all genes on the network were assigned disease gene probability. To determine the boundaries of gene expansion, we developed a new idea from GSEA and calculated the running sum from the top to bottom of the ranked gene list. Specifically, when we expanded ATDH genes from the ATDH seed genes, all genes on the network were ranked according to the likelihood of being an ATDH-related gene. If we encountered a gene that was not an established SCZ seed gene, − − G N G was added to the running sum, where N is the number of all network genes and G is the number of known established SCZ seed genes; otherwise, − N G G was added 44,45 . Based on the running sum, a peak of network expansion from the ATDH seed genes was determined. This cutoff was then used to obtain a list for ATDH seed gene expansion. A list for SCZ seed gene expansion was generated similarly. Common genes between these two lists (i.e., overlapped genes) were studied because these might more robustly reflect the common genetic basis of the two diseases. Finally, we screened all possible causal gene factors by testing their neighbors for overlapped genes using the hypergeometric test P value 46 .
where N is the number of all network genes, M is the number of common disease genes, n is the number of neighbor genes, and m is the number of neighbor genes that are common disease genes. National Knowledge Infrastructure(CNKI) and the Chinese BioMedical Literature Database were searched for studies with publication dates up to 03/31/2016 using the following keywords: ("anti-tuberculosis drug-induced liver injury", "anti-tuberculosis drug-induced hepatotoxicity", "ATDH" or "ATDILI") and ("Schizophrenia"), together with the full name or abbreviation of each candidate causal gene, including: "cytochrome P4502E1", "CYP2E1", "UDP-glucuronosyltransferase 1A6", "UGT1A6", "glutathione S-transferaseM1", "GSTM1", and "glutathione S-transferase T1" or "GSTT1". The references of retrieved articles were also reviewed to identify additional relevant literatures.
Inclusion and exclusion criteria. Articles included in the meta-analysis complied with the following Data extraction. Data extraction was performed independently by two reviewers using a standardized protocol and reporting form. Discrepancies between the two reviewers were resolved by further discussion with a third party. For overlapping studies, the study with the larger sample size was retained for the meta-analysis. The recorded study characteristics included: (1) the first author's name, (2) publication year, (3) sample ethnicity, (4) control and case characteristics, (5) methods used for genotyping and (6) target genes.
Statistical analysis. The strengths of the associations between gene polymorphisms and the risk of ATDH or SCZ were measured using ORs with corresponding 95% CIs. Pooled ORs were calculated for null vs. present genotype camparisons. If the total number of studies was < 20, the random-effect model of meta-analysis was used to calculate the pooled ORs according to the DerSimonian-Laird method; if the number was ≥ 20, the fixed-effect model was applied according to the Mantel-Haenszel method 48,49 . Inter-study heterogeneity was assessed using the chi-square-based Q-test (Cochran's Q statistic), and a strict P value < 0.1 was considered statistically significant 14  and corresponding 95% CIs were also calculated to describe the percentages of variability in the effect estimates that were attributable to heterogeneity rather than sampling error, an I 2 > 50% was roughly considered to indicate substantial heterogeneity 50 . This formula use Q as well as degrees of freedom 50,51 . A sensitivity analysis, in which one study at a time was removed prior to analysis, was conducted to evaluate whether a single study would significantly affect the results. This analysis used a model other than the model used to calculate the pooled ORs. Harbord's test was used to test small-study effects among which publication bias might be a contributor, and a P value < 0.1 was considered representative of statistically significant publication bias 52 . All statistical analyses were implemented in Review Manger 5.2 (The Nordic Cochrane Centre, Copenhagen, Denmark) and Stata version 11.2 (Stata Corporation, College Station, TX, USA).

Credibility of meta-analysis results. A power analysis was performed using the Power and Sample Size
Program with α = 0.05 as the level of significance; effects sizes were estimated from the meta-analyses 19 . To assess the noteworthiness of an association, the FPRP 34 was estimated using a FPRP threshold of 0.2 and prior probabilities of 0.05-10 −6 . Cumulative evidence for genetic associations of GSTM1 and GSTT1 present genotypes with ATDH and SCZ, respectively, were assessed according to the Venice interim criteria, which include the amount of evidence, replication of results and protection from bias 22 . Regarding the amount of evidence, grades of A, B, and C were given for a n minor > 1,000, 100-1,000 and < 100, respectively, where n minor refers to the total number of cases and controls with the least frequent genotype. Regarding replication, grades A, B, and C were given for I 2 values < 25%, 25-50% and > 50%, respectively. Regarding protection from bias, any of the following criteria should be met: (1)