Introduction

During the past years, several genomic loci were reproducibly found to be associated with schizophrenia.1 However, the exact identification of the underlying causative mutations remained elusive.2 Owing to the absence of coding disease mutations, an important role of noncoding variants can be hypothesized. In order to approach this hypothesis, we applied a sequencing strategy to search for noncoding mutations among affected individuals. As complete noncoding regions are prohibitively large, we focused on gene control regions, that is fragments located around transcription start sites. This choice was made under the assumption that functional noncoding variants may affect regulation of gene expression. Within gene control regions, preference was given to genomic regions that are highly conserved between human and rodents. Studies of human single nucleotide polymorphisms (SNPs) showed that these regions display a signature of purifying selection,3 what makes them first choice targets in the search for disease mutations. Although some examples for phenotypic effects of sequence variation in conserved noncoding regions exist,4 according to our knowledge no sequencing study so far has focused on nucleotide diversity of these regions among individuals affected with a complex disease phenotype.

In the present study, we sequenced 27 kb of genomic DNA from six schizophrenia-associated gene loci in 37 schizophrenic and 25 healthy individuals. Assuming that variants of higher penetrance are more likely to lead to several affected family members, exclusively patients with at least one affected first-grade relative were included. We targeted fragments belonging to the gene loci neuregulin-1 (NRG1), dystrobrevin binding protein-1 (DTNBP1), regulator of G-protein signaling-4 (RGS4), dopamine receptor-3 (DRD3), RAC-alpha serine/threonine-protein kinase (AKT1) and the brain-derived neurotrophic factor (BDNF). The schizophrenia-associated gene loci NRG15, 6, 7, 8, 9 DTNBP110, 11, 12, 13, 14 RGS415, 16, 17, 18 encode proteins, which play a role in synaptic function. Among other functions, NRG1 participates in glutamatergic signaling by regulating the N-methyl-D-aspartate (NMDA) receptor through the interaction of the NRG1 protein and its receptors.5 Also DTNBP1 appears to influence schizophrenia risk through effects on glutamate function.19, 20 RGS4 appears to modulate activity of some serotonergic and metabotropic glutamatergic receptors,21 whereas its own expression is modulated by dopaminergic signaling.22 The involvement of the dopamine receptor DRD3 in schizophrenia is supported by a meta-analysis of association studies.23 and its role as a drug target. Supported by similar lines of evidence is AKT1, which is a target of lithium and associated with schizophrenia in several populations.24, 25, 26 The potential role of BDNF in schizophrenia has recently been reviewed27 and it was reported to increase the genetic risk for schizophrenia in a Scottish population.28 All these associations suggest the existence of common variants underlying schizophrenia at the respective gene loci. However, these associations also make these loci to excellent candidate genes in the search for rare mutations that could play a role in the etiology of schizophrenia. To measure the relative excess of rare, presumably deleterious, variants among schizophrenia patients, we use a summary statistic based on Tajima's D.29

Materials and methods

Patient samples

All individuals included in the study were of German descent and ascertained at the Department of Psychiatry at the University of Bonn. Written informed consent was obtained from all patients and controls. All patients had been interviewed by experienced psychiatrists using the Structured Clinical Interview for DSM-IV Disorders. Lifetime ‘best estimate’ diagnoses according to DSM-IV criteria were based on multiple sources of information, including personal structured interview (SCID I), medical records, and family history method. Consensus diagnoses were performed by two psychiatrists, and whenever necessary, more psychiatrists were included in the decision process. The control group was matched by age, sex and ethnic origin. All samples were checked prior to the sequencing experiments for good DNA quality.

DNA sequencing and sequence annotation

Double-stranded DNA sequencing using the chain-termination method30 was performed commercially (AGOWA, Berlin, Germany). Noncoding regions were targeted for sequencing, if annotated in the Ensembl database (version 17.33) as highly conserved between human and rodents (Mus Musculus or Rattus Norvegicus) and located within 10 kb upstream or downstream of the transcription start site. Conserved regions were extended into flanking regions in both directions that fragments of the size of about 500 bp were targeted for sequencing.

Evaluation of sequencing traces was performed with the software packages Phred and Phrap,31 Polyphred32 and Consed.33 It has been shown that Phred's base calling is highly accurate.31 To minimize base calling errors, a Phred quality score of 30, that is 99.9% accuracy of the base call, was used for polymorphism detection. All SNPs detected by Polyphred were manually checked by two human experts to detect false positive SNPs. In addition, a SNP had to be observed in both forward and reverse reads to be considered as a true polymorphism.

Consensus sequences of sequenced fragments were aligned to the human genome using Megablast,34 allowing for the inclusion of human genome annotations into the data analysis process. Data analysis was based on the Ensemble database (ftp://ftp.ensembl.org/pub/release-27/multi-species-27.1/data/mysql/ensembl_compara_27_1, ftp://ftp.ensembl.org/pub/release-28/human-28.35a/data/mysql/homo_sapiens_core_28_35a). Information on noncoding conservation was read from the database tables genomic_align.txt.table and method_link.txt.table. Highly conserved regions are those identified by the BLASTZ-NET TIGHT method. This definition of highly conserved regions is based on a score that considers an estimate of the neutral mutation rate at the particular locus and roughly corresponds to at least 80% sequence identity over at least 100 bps.

Nucleotide diversity estimates

In order to quantify sequence variability, we separately estimated for each sample the two nucleotide diversity parameters π and θ.35 Average nucleotide heterozygosity π is the expected number of nucleotide differences per site between two randomly selected sequences from a population. This can be estimated from the differences d per nucleotide site over all sequence pairs i,j in a sample of n sequences:

The level of nucleotide polymorphism θ is proportion of polymorphic nucleotide sites that are expected to be observed in a population sample. This can be estimated from the proportion of nucleotide polymorphism S in a sample of n sequences:

Furthermore, Tajima's D statistic29 was calculated for the different sequence categories. Tajima's D statistic is the normalized difference between π and θ.

A negative Tajima's D-value indicates an excess of rare alleles, whereas more positive values indicate an excess of intermediate frequency variants. Calculations were performed by scripts that were implemented by one of the authors (JF) at the University of Bonn as part of a larger software package that had been developed for the analysis of sequencing data in the context of an annotated human genome reference sequence.

Results

Overall analysis of nucleotide diversity

In total, we observed 108 SNPs within the resequenced fragments, 60% of them unknown to dbSNP (version 123). This fraction of unreported SNPs is quite large, when compared to the fraction of SNPs discovered in coding regions on independent datasets and taking into account the strong increase in public SNP database size.3 Potential explanations are the poorer database representation of noncoding SNPs or an excess of disease-related variants in the sequenced sample.

To compare the level of nucleotide diversity between the case and control sample and account for sample size, we next separately estimated for each sample the average heterozygosity π, level of polymorphism θ and Tajima's D statistic. Tajima's D is the normalized difference between π and θ, with smaller values denoting the relative excess of rare variants and higher values denoting the relative excess of common variants (see Materials and methods for details). Average heterozygosity π displayed similar values in the schizophrenia cases and controls (cases: 7.99 × 10−4 controls: 7.95 × 10−4). However, the level of polymorphism θ was higher among affected individuals (cases: 7.64 × 10−4 controls: 6.83 × 10−4), pointing to a relative increase of polymorphic sites in this sample. The consequently smaller Tajima's D among affected individuals (cases: 0.16 controls: 0.58) further indicates more rare variants among cases than among controls.

To assess the significance of this difference in Tajima's D between controls and cases, we randomly permuted the affection status of sequenced individuals. In 2400 out of totally 100 000 random permutations, we found a stronger decrease of Tajima's D in the case sample than in the observed data (P=0.024, one-sided test for smaller Tajima's D among patients). Thus, the empirical significance analysis indicates that the observed difference of Tajima's D between controls and cases could go beyond random fluctuations. We feel that the one sided-test for an excess of rare variants in the patient sample is justified here, because one may assume that disease causing mutations are often deleterious and deleterious mutations are often rare in a population. The widespread existence of rare deleterious mutations in coding regions of the human genome had been shown by several earlier sequencing studies.36, 37, 38

We next repeated the estimation of nucleotide diversity parameters 1000 times with 25 randomly picked individuals from the schizophrenia sample. The average estimates for π, θ and Tajima's D (Table 1) were close to those estimated for the full sample of 37 schizophrenia patients. In none of the 1000 random schizophrenia subsamples we observed a larger Tajima's D-value than in the equally sized control sample. This further supports that the observed differences do not result from the different size of the case and the control sample or only a small subset of case individuals.

Table 1 Overall nucleotide diversity of noncoding regions

Nucleotide diversity of noncoding conserved regions

Following human genome annotations provided by the Ensembl database (version ensembl core 28.35a, ensembl compara 27.1),39 we now categorized the resequenced noncoding regions into those highly conserved between humans and rodents and those not fulfilling this criterion. These conserved regions have experienced reduced evolutionary change over the past 80 million years, either due to chance, lower mutation rate or evolutionary constraint. We find that the 37 SNPs located in those regions more often were rare SNPs (minor allele frequency <5%) than SNPs not located in such regions (χ2=13.8, df=2, P=0.001) (Figure 1). The average minor allele of SNPs located in conserved regions was significantly lower both in the case (P<0.001 by two-sided Mann–Whitney U-test) and the control sample (P=0.004 by two-sided Mann–Whitney U-test), pointing to the influence of purifying selection on the frequency of alleles in conserved noncoding regions both among patients and among controls.

Figure 1
figure 1

Percentage of SNPs from different frequency categories (minor allele frequency: rare <5%, infrequent <20%, common 20%) that were observed in noncoding conserved (black bars) and noncoding nonconserved (grey bars) regions.

Within conserved regions, we observed a markedly decreased diversity, as measured by π and θ, both in the case and the control sample (Table 1). Interestingly, both among affected and unaffected individuals, Tajima's D was negative within conserved regions (cases: −1.22 controls: −0.82), whereas it was positive within nonconserved regions (cases: 0.86 controls: 1.11). The positive Tajima's D of the presumably neutrally evolving nonconserved regions could result from population demographic events. The negative Tajima's D of the conserved regions supports purifying selection as one cause of interspecies sequence conservation. Consistent with disease susceptibility as a reason of purifying selection, Tajima's D-value is smaller in the case sample both in conserved and nonconserved noncoding regions. However, the case–control comparison did not reach nominal significance after the restriction to only one of these sequence categories, presumably due to reduced statistical power resulting from the smaller number of SNPs in each subcategory.

Nucleotide diversity of single gene loci

Diversity estimates of the individual gene loci were consistent with the results jointly observed over all loci (Table 2). Estimates of π and θ were above average at the loci RGS4, NRG1, and AKT1, whereas they were below average at DRD3, DTNBP1 and BDNF. A higher diversity was measured both by π and θ in the case sample at each locus, except DRD3. As measured by Tajima's D, rare variants were more common among cases at each gene locus, except DTNPB1. This corresponds to a nonsignificant trend towards smaller Tajima's D-values among cases (P=0.086 one-sided Wilcoxon test of two related samples). To quantify the contribution of single gene loci to the overall difference between controls and cases, we separately repeated the permutation analysis for each gene locus. We measured the strength of the signal at each locus as the Z-score of the observed difference of Tajima's D between controls and cases on the distribution of this difference in 100 000 randomly permutated samples (Table 2). The difference in the randomly permuted samples was distributed approximately normal with a mean close to zero for all six loci. As the actually observed decrease of Tajima's D in the case sample did not belong to the most extreme 5% at any locus, nominally significant differences could not be achieved at any single gene locus.

Table 2 Noncoding nucleotide diversity at each investigated gene locus

An interesting question is now to find out, which loci contribute to the significant difference that is observed across all loci. We approached this task by modifying the previously developed SNP set association strategy40 to search for the subset of gene loci, which produces the strongest difference between the two samples. This adapted strategy, which might be denoted ‘locus set association’, first ranks the gene loci with respect to the strength of the observed difference in Tajima's D between controls and cases, as measured by the above calculated Z-scores. Then, increasing numbers of loci from this ranked list are included into the case–control comparison, that is the two samples are compared by permutation analysis only at the top locus, the top two loci and so on up to the full number of all gene loci. This gives a list of P-values referring to increasing numbers of gene loci and measuring the strength of the case–control signal for the respective subsets of loci. The smallest P-value from this list, pmin, now indicates the particular subset of loci, which jointly produce the strongest difference between the two samples (Figure 2, Supplementary Table 1). However, due to assertive choice of this particular combination of gene loci, pmin should not be interpreted as an error probability. The list of loci jointly producing the strongest signal in our data contains the genes AKT1, BDNF, RGS4 and potentially NRG1, whereas the inclusion of DTNBP1 and DRD3 seems to weaken the overall signal. Thus, rare alleles preferentially located at the former three loci might contribute to the genetic etiology of schizophrenia in our patient sample.

Figure 2
figure 2

P-value estimates based on 100 000 random permutations of the affection status when considering an increasing number of gene loci. The six loci are initially ranked with respect to the strength of the difference in Tajima's D between controls and cases. Then increasing numbers of gene loci from the ranked list AKT1, BDNF, RGS4, NRG1, DRD3 and DTNBP1 are included into the analysis, denoted in the figure by the number of included loci.

Discussion

We investigated nucleotide diversity of noncoding fragments around the transcription start site of six schizophrenia-associated gene loci. An excess of rare variants was observed in the sample of schizophrenia patients as compared to a control sample. This observation suggests that rare variants in the sequenced fragments could contribute to the etiology of schizophrenia. The location of the fragments in gene control regions is consistent with an important role for regulatory mutations in the genetics of schizophrenia. However, we are aware of the fact that the observed differences between cases and controls are only borderline significant. It would be, therefore, interesting to see, if the observed differences are part of a more general pattern. Thus, larger case–control sequencing studies at the six investigated gene loci as well as further candidate loci appear to be a worthwhile endeavor to understand the genetics of schizophrenia and related phenotypes.

The presented use of a summary statistic, such as the difference of Tajima's D between the control and case sample, might provide an important step forward in the analysis of case–control sequencing experiments. The significance of the observed difference in Tajima's D is tested by a permutation-based strategy, quantifying the strength of the difference in the allele frequency spectrum between the two samples. Although the allele frequency spectrum significantly differs between the samples across all gene loci, not all loci contribute equally to this signal. Consequently, a heuristic search efficiently identifies a subset of loci, which produces a stronger signal than any single locus or the full list of all gene loci.

The choice of investigated gene loci was based on earlier reports of common schizophrenia susceptibility haplotypes. Although some of these earlier reports of common susceptibility haplotypes may turn out to be false positive, rare variants at those loci still could play a role in the etiology of schizophrenia. This idea is supported by fact that all of the six genes are good functional candidates, based on the current understanding of the molecular biology of schizophrenia. Nevertheless, a weaker signal at a gene locus could also indicate a lower impact of deleterious sequence variation at this locus on the schizophrenia phenotype, here selected for patients with positive family history.

An important role in the origin of conserved noncoding regions has been proposed for purifying selection several years ago.41 More recent studies showed a significantly lower allele frequency for human SNPs from conserved noncoding regions,3, 42 consistent with the ongoing action of purifying selection in the human lineage. In the present study, we again observe an excess of rare variants in genomic regions that are conserved between human and rodents, as compared to nonconserved regions. This excess of rare variants in conserved regions applies both to the case and the control sample, pointing to the occurrence of deleterious mutations also in the latter. On the other hand, the patient sample displays more rare variants both within conserved and within nonconserved regions. This indicates an increased number of deleterious mutations, which could be due to an increased number of schizophrenia susceptibility mutations.

In conclusion, our investigation of noncoding sequence variability at six schizophrenia-associated gene loci found rare variants more common among patients than among healthy controls. In addition, rare variants were found concentrated within regions that are conserved between human and rodents. These observations are consistent with the existence of deleterious noncoding mutations in the general population, which are enriched among affected individuals and within conserved regions. The overall excess of rare variants in the case sample primarily resulted from differences at the loci AKT1, BDNF and RGS4. The described strategy might provide a useful framework for further medical sequencing studies in the analysis of genetically complex traits.