Accumulation of Minor Alleles of Common SNPs in Schizophrenia

Schizophrenia is a common neuropsychiatric disorder with a lifetime risk of 1%. A number of large scale genome wide association studies have identified numerous individual risk single nucleotide polymorphisms (SNPs) whose precise roles in schizophrenia remain unknown. Accumulation of many of these risk alleles has been found to be a more important risk factor. Consistently, recent studies showed a role for enrichment of minor alleles (MAs) in complex diseases. Here we studied the role of MAs in general in schizophrenia using public datasets. Relative to matched controls, schizophrenia cases showed higher minor allele content (MAC), especially for the sporadic cases. By linkage analysis, we identified 82 419 SNPs that could be used to predict 2.2% schizophrenia cases with 100% certainty. Pathway enrichment analysis of these SNPs identified 17 pathways, 15 of which are known to be linked with Schizophrenia with the remaining 2 associated with other mental disorders. These results suggest a role for a collective effect of MAs in schizophrenia and provide a method to genetically screen for schizophrenia. Abbreviations MAs minor alleles MAC minor allele content MAF minor allele frequency AUC under the curve TPR True positive rate


Introduction
Schizophrenia is one of the most frequent neuropsychiatric disorders with a lifetime risk of 1% in the general population (McGrath et al., 2008;McGrath, 2007). This disease is often chronic and places a great burden on family and society. It is characterized by the occurrence of delusions, hallucinations, disorganized speech and behavior, impaired cognition, and mood symptoms (van Os and Kapur, 2009). Data from twin, family, and adoption studies showed strong evidence that schizophrenia is predominantly a genetic disorder with high heritability (Sullivan et al., 2003).
The precise mode of Schizophrenia inheritance is unclear and risk prediction using known genetic components is presently unrealistic. Based on investigating familial syndromes with schizophrenia-like phenotypes, two rare variants have been identified as associated with schizophrenia: the 22q11 deletion (Ivanov et al., 2003;Karayiorgou et al., 1995;Sporn A Fau -Addington et al., 2004) and a 1:11translocation (Blackwood et al., 2001;Hodgkinson et al., 2004). With the advent of copy number variants (CNVs) microarray technology, an increasing number of large rare deletions have been detected in schizophrenia patients (Levinson et al., 2011;Moreno-De-Luca et al., 2010;Walsh et al., 2008). However, the effect size associated with common CNVs is smaller than initially estimated (Wray and Visscher, 2010). In addition, many candidate genes for schizophrenia have been found by Genome-wide association studies (GWAS) (O'Donovan et al., 2008;Schizophrenia Psychiatric Genome-Wide Association Study, 2011). However, these SNPs are at frequencies of 20-80% in the general population and only account for a minimal increase in risk (Mulle, 2012;Tiwari et al., 2010). It is likely that schizophrenia may be related to accumulation of many risk alleles at thousands of loci (International Schizophrenia et al., 2009).
An allele can belong to either the major or the minor allele according to its frequency in the population and the minor allele (MA) has frequency (MAF) < 0.5. Most known risk alleles are MAs (Park et al., 2011). Our previous studies have shown that the collective effects of MAs may play a role in numerous traits and diseases (Yuan et al., 2014;Zhu et al., 2015a;Zhu et al., 2015b). Specifically, enrichment of genome wide common SNPs or MAs is associated with Parkinson's disease (Zhu et al., 2015b) and lower reproductive fitness in C.elegans and yeasts (Zhu et al., 2015a). We here studied the role of genome wide MAs as a collective whole in schizophrenia using previously published GWAS datasets.

Materials and Methods: 2.1 Subjects
Two GWAS datasets of Cases and controls (phs000021.v3.p2, phs000167.v1.p1 (International Schizophrenia et al., 2009;O'Donovan et al., 2008;Stefansson et al., 2009;Suarez et al., 2006) were downloaded from database of Genotypes and Phenotypes (dbGaP). All subjects we selected for analysis are European-American ancestry population. The SNPs of all subjects were genotyped using AFFY_6.0 in genome-wide. Principal component analysis (PCA) using the GCTA tool was performed to analyze the genetic homogeneity of the subjects (Yang et al., 2011). Outliers were excluded through selection of the principal component values (Supplementary Table S1). Duplicated subjects were excluded, and the parents of cases were also excluded.

SNPs selection
All SNPs for analysis in this study are autosomal SNPs. In addition, we excluded SNPs showing departure from the Hardy-Weinberg equilibrium (P < 0.01), with missing data < 5%, and with MAF< 10 -4 . After these filters, there were 512 673 SNPs remaining (Table 1).

Statistical analysis
The Hardy-Weinberg equilibrium, missing data, MAF and logistic regression analysis were performed using PLINK Tools (Purcell et al., 2007). MAC per subject means the ratio of the total number of MAs divided by the total number of SNPs scanned (non-informative NN SNPs were excluded). The script for MAC calculation was previously described (Zhu et al., 2015b). Risk coefficient of each SNP was calculated with logistic regression test (equal to coefficient logistic regression test). The weighted risk score of a MA was calculated as follows: for homozygous MA, the risk coefficient was 1 x the coefficient, for heterozygous MA, it was 0.5 x the coefficient, for homozygous major allele, the coefficient was 0. The total weighted risk score from all MAs in a subject was obtained by summing up the weighted risk coefficient of all MAs by the script as described previously (Zhu et al., 2015b).

Genetic risk models construction and evaluation
125 prediction models were obtained from different combinations of MAF and p-value (Zhu et al., 2015b). For external cross-validation, the phs000021.v3.p2 study was used as training dataset, and the phs000167.v1.p1 study as validation dataset. Receiver operating characteristic (ROC) curves were used to describe the ability to differentiate cases and controls. True positive rate (TPR) is the proportion of cases with weighted risk scores higher than all of the controls. Area under the curve (AUC) and the TPR were calculated for each model by prism5.
For the internal cross-validation, a 10 fold cross-validation was used to test the models with good performance in external cross-validation. The models with TPR > 2% and AUC > 0.58 were chosen for internal cross-validation. Subjects in phs000021.v3.p2 were divided into 10 sub-sets randomly. When a sub-set was used as the validation data, the other 9 sub-sets were used as the training data. The cross-validation process was repeated 10 times, and the mean AUC and TPR values were calculated from these 10 results.

Collective effects of minor alleles in Schizophrenia
We made use of the published GWAS datasets (phs000021.v3.p2 and phs000167.v1.p1). We first cleaned these datasets by removing outliers in principle component analysis (PCA) plots (see Figure in Supplementary Table S1). The cleaned datasets contains 1 003 cases and 1 152 controls in phs000021.v3.p2 dataset, and 828 cases and 1 068 controls in phs000167.v1.p1 dataset (Table 1). MA status of each SNP was then obtained by using the control cohort using MAF < 0.5 as cutoff. MAC of each subject was calculated, and the mean MAC of cases and controls was compared. The results showed that the mean MAC of schizophrenia cases is significantly higher than that of controls in both the phs000021.v3.p2 data (mean MAC [mean ± stdev], cases vs controls is 0.2235 ± 0.0010 vs 0.2233 ± 0.0011, P = 2.71E-06, t test) and the phs000167.v1.p1 data (cases 0.2251± 0.0011 vs controls 0.2249 ± 0.0011, P = 9.21E-04, t test, Supplementary Table S2). MAC values of both cases and controls showed normal distribution but cases were shifted slightly to the right or higher MAC values ( Figure 1A and B).
To study of the role of MAC in sporadic versus familial cases of schizophrenia, we further combined the subjects in the two datasets and calculated the MAC of each subject again. Then we compared the mean MAC of sporadic schizophrenia cases (n = 1217) to that of cases with family history of a psychotic illness (n = 493). We also compared these two groups of cases with the controls (n = 2220). The results showed that the mean MAC of sporadic schizophrenia cases (cases1) was slightly higher than cases with family history (cases2) (cases1 0.221505 ± 0.001039 vs cases2 0.221503 ± 0.001044， P = 0.49, one-way ANOVA). The MAC difference between sporadic cases and controls was significant (cases1 0.221505 ± 0.001039 vs controls 0.221412 ± 0.001054, P = 6.4E-03, one-way ANOVA), and more so than that between cases with family history and controls (cases2 0.221503 ± 0.001044 vs controls 0.221412 ± 0.001054, P = 0.04, one-way ANOVA, Supplementary Table S2). The results confirmed the expectation that collective effects of minor alleles should play more important role in sporadic Schizophrenia cases because familial cases may involve major effect mutations in a small number of genes.
We also calculated a risk coefficient score for each SNP by logistic regression analysis and obtained a weighted risk score based on the MA status and the risk coefficient score as previously described (Zhu et al., 2015b). The MAC of each individual was then converted into a weighted risk score by summing up the weighted risk scores of each SNP. The mean weighted risk score of cases was found to be far greater than that of controls in both datasets (cases 290.43 ± 55.86 vs controls -253.59 ± 57.90 in phs000021.v3.p2, P = ~0; cases 294.37 ± 50.77 vs controls -188.19 ± 50.50 in phs000167.v1.p1, P = ~0, t-test, Supplementary Table S2). This was apparent on a distribution plot of the weighted risk score with clearly separated cases and controls ( Figure 1C and D).

Evaluation of risk prediction models
In order to get an optimal MAs model or a subset of MAs for risk prediction, we divided the MAs into 5 groups according to MAF (<0.5, <0.4, <0.3, <0.2, and <0.1, Fig 2). We performed logistic regression analysis and obtained the p-values for each SNP. Based on these p-values, we divided each group into 25 subgroups and obtained a of 125 prediction models (Fig 2, Supplementary Table 3). We then performed external cross-validation and internal cross-validation analyses to test these models. In external cross-validation, we used phs000021.v3.p2 as the training dataset and phs000167.v1.p1 as the validation dataset. We then used the receiver operator characteristic (ROC) curve to examine the discriminatory capability or area under the curve (AUC) of each model in the testing dataset. We found 17 models with AUC > 0.58 and true positive rate (TPR) > 2%. Among these models, the best TPR is 2.78%, and the best AUC is 0.6 (Fig 2 and Supplementary Table S3).
We next tested whether the set of 82 419 SNPs is relatively specific to schizophrenia. We compared these SNPs with the previously identified 37 564 SNPs specific for Parkinson's disease (Zhu et al., 2015b). Only 1 239 SNPs were found shared between these two sets, indicating that different diseases may be linked with different sets of SNPs.

SNPs annotation
We next examined the potential functions of the 82 419 SNPs by annotating them with the ANNOVAR software (Wang et al., 2010). There were 82 834 SNPs annotation results in total due to the fact that some SNPs may lie in between two genes and could hence generate two annotation results. We found 0.956% of SNPs in exonic regions (Table 2, Supplementary Table S4).
We mapped the 82 419 SNPs to gene loci using WebGestalt tools, and found 6 588 genes (Supplementary Table S4). These genes were characterized using Gene Ontology in WebGestalt according to biological process, molecular function, and cellular component. As shown in Table 3, most of these genes were related to cytoskeletal proteins, phospholipid, anion and actin biding, GTPase regulators, transmembrane receptor protein tyrosine kinases, transmembrane receptor protein kinases, small GTPase regulators, and mental ion transmembrane transporter activities, and nucleoside-triphosphatase regulators.

Discussion
In this study, we showed enrichment of MAs in schizophrenia cases relative to matched controls. We also identified a set of 82 419 SNPs that can predict with certainty a fraction of schizophrenia cases. These results are consistent with previous work on other complex diseases and traits (Yuan et al., 2014;Zhu et al., 2015a;Zhu et al., 2015b).
There were reports of male bias in schizophrenia (Aleman et al., 2003). The ratio of males to females in cases of phs000021.v3.p2 data was 2.28 but was close to 1 in the other dataset. We however did not observe significant differences in MAC values between male and female cases in both datasets. Thus, MAC may play a similar role in both sexes.
Recent studies have shown that a much larger than expected portion of the human genome may be functional (Fung et al., 2014;Hu et al., 2013;Sauna and Kimchi-Sarfaty, 2011). Genetic diversities are at saturation levels as indicated by the observation that higher fractions of fast evolving SNPs are shared between different human groups (Yuan et al., 2017). This raises the question of what selection forces are keeping genetic diversity levels from increasing with time. By linking the total amount of SNPs or MAs in an individual to complex diseases and traits, it is clear that complex diseases could serve as a negative selection mechanism to prevent abnormal increase in SNP numbers in an individual. It is intuitively obvious that the overall property of the genome as a whole should be linked with the wellbeing of an organism. Our results here on schizophrenia further confirmed the hypothesis we put forward before that a highly complex and ordered system such as the human brain must have an optimum limit on the level of randomness or entropy in its building parts or DNAs (Zhu et al., 2015b).
It has been difficult to use any genetic markers or combinations of them to predict risk of schizophrenia. We here identified a set of 82 419 SNPs that could predict 2.2% cases with 100% specificity. Although this is still a low percentage, it may still prove valuable for prenatal diagnosis of schizophrenia. The set of 82 419 SNPs specific for schizophrenia was highly linked with pathways known to be involved in the disease, thereby validating our method of looking for disease specific set of SNPs. This set is much larger than any known from previous studies (International Schizophrenia et al., 2009). This large collection of risk alleles is not unexpected if most genome sequences are functional as explained by the maximum genetic diversity (MGD) theory (Huang, 2008;Huang, 2009;Huang, 2016), which inspired this work in the first place. Future studies using larger sample sizes may help identify a more specific set of risk SNPs that could predict higher fraction of cases.

Contributors
Shi Huang and Zuobin Zhu designed the study and wrote the protocol. Pei He, Xiaoyun Lei and Dejian Yuan managed the literature searches and analyses. Pei He undertook the statistical analysis and wrote the first draft of the manuscript. All authors contributed to and have approved the final manuscript.

Conflict of interest
The authors declare that they have no conflict of interest.  O: the number of genes in our gene set and also in the category; E: the expected gene number in the category; R: ratio of enrichment; P: p value adjusted by the multiple test adjustment. MAC：Minor allele content of SNPs with MAF < 0.5.

Figure 2. Discriminatory ability of different prediction models.
SNPs were divided into 5 groups based on MAF, each group was further divided into 25 subgroups based on p-values from the logistic regression test and 125 prediction models were obtained. AUC (A) and TPR (B) were calculated using a training dataset and a validation dataset to evaluate the discriminatory ability.