Introduction

The nine-amino-acid peptide oxytocin is both a hormone and neurotransmitter in mammals, including humans. Its peripheral actions include stimulating cervical dilation and uterine contraction during childbirth, milk release in lactating mothers, and orgasm during sexual activity. In the brain, it influences cognitive, emotional, and social functions, including maternal and sexual behaviors (Caldwell and Young 2006). Particularly striking illustrations of these are the findings that intranasal oxytocin administration increases “mind-reading” ability (Domes et al. 2007) and trust, the latter by specifically increasing an individual’s willingness to accept social risks arising through interpersonal interactions (Kosfeld et al. 2005). Such characteristics might have been particularly important during the recent evolution of large complex human societies, so oxytocin regulation might have undergone selection during recent evolution.

In addition, oxytocin has been linked to autism (OMIM 209850) (Carter 2007). This complex phenotype is characterized by poor verbal communication, decreased social interaction, and stereotyped patterns of behavior. It shows high heritability but a complex etiology that is poorly understood, although it includes de novo mutations in a small proportion of autism patients. Among these was a deletion that removed a 1.1 Mb region containing 27 genes from chromosome 22 (Sebat et al. 2007), including that coding for oxytocin, OXT. This was of particular interest because of oxytocin’s influence on social interactions noted above (Kosfeld et al. 2005; Domes et al. 2007), because oxytocin plasma levels are decreased in autistic subjects (Modahl et al. 1998), and because intravenous oxytocin has been reported to improve some autistic symptoms, such as the retention of social information (Hollander et al. 2007). If oxytocin level is relevant to autism, could genetic variants that decrease expression increase the chances of developing autism in nondeleted patients when combined with other genetic or environmental factors?

Oxytocin is synthesized as a precursor that also contains a signal peptide and the protein neurophysin I, which are together encoded by the OXT gene. The role of neurophysin I is unclear, but it may be important for protection or packaging of oxytocin (Caldwell and Young 2006). We undertook a study of the genetic variation in and around the OXT gene in healthy individuals from four populations to discover the full spectrum of variation and assess the selective pressures that have shaped the region during recent human evolution.

Materials and methods

DNA samples

Human DNA samples were from the HapMap populations (The International HapMap Consortium 2005): 23 Yoruba from Ibadan, Nigeria (YRI), 23 Chinese Han from Beijing (CHB), 22 CEPH Utah, USA, residents with ancestry from northern and western Europe (CEU), and 23 Luhya from Webuye, Kenya (LWK). DNAs were purchased from the Coriell Institute for Medical Research (Camden, NJ, USA). In addition, one chimpanzee (Pan troglodytes) sample from the European College of Cell Cultures (ECACC) (Salisbury, Wiltshire, UK) was included as an outgroup.

Resequencing and detection of variants

A 2.2-kb region [Chr20: 2999590–3001789, National Center for Biotechnology Information (NCBI) build 36] was amplified from genomic DNA with the primer pair 1F and 4R (Table 1). The region is highly GC rich, and successful amplification required the addition of dimethyl sulfoxide (DMSO) to the reaction. The 25-μl polymerase chain reaction (PCR) mixture contained 1 × Taq buffer (Invitrogen), 200 μM deoxyribonucleotide triphosphate (dNTP), 2 mM magnesium sulfate (MgSO4), 0.4 μM 1F primer, 0.4 μM 4R primer, 1.5 U of Platinum High Fidelity Taq DNA polymerase (Invitrogen), 5% DMSO, and 200 ng template DNA. The reaction was initiated by incubating at 94°C for 15 min, followed by 35 cycles of 94°C for 30 s, 55°C for 30 s, and 68°C for 3 min, and included a final extension at 68°C for 7 min 30 s. Then, 10 μl PCR product was purified using exonuclease I (0.067 U) and shrimp alkaline phosphatase (0.67 U) for 1 h at 37°C, followed by 15 min at 85°C to denature the enzymes, and then sequenced by the Faculty Small Sequencing Projects Group at The Wellcome Trust Sanger Institute with all four pairs of primers (Table 1) to produce overlapping reads covering both strands.

Table 1 Primer sequences, polymerase chain reaction (PCR) product sizes and overlap region sizes

All potentially polymorphic positions were flagged by the Mutation Surveyor v. 2.0 software (SoftGenetics, State College, PA, USA) and checked manually. Variable positions were compared in overlapping and complementary reads in all individuals, so that most variants were confirmed on both strands in multiple reads. In addition, four blind replicate samples were included in the experiment and showed complete concordance. As a quality check against external data, genotype calls were compared with HapMap calls for the three single nucleotide polymorphisms (SNPs) from this region typed by HapMap (rs2740210, rs2770378, rs2740208) in the 68 common samples (The International HapMap Consortium 2005). Six discrepancies were found (Electronic supplementary material Table 1), mostly corresponding to heterozygote calls in one data set called as homozygotes in the other (overall 1.7% allele difference). Our calls at these positions appeared clear and were used in our analyses. All novel SNPs were confirmed by genotyping using a SNaPshot primer extension protocol (Applied Biosystems; primers in ESM Table 2, results in ESM Fig. 1).

Statistical analyses

Derived alleles were identified from the chimpanzee and macaque sequences (The Chimpanzee Sequencing and Analysis Consortium 2005, Gibbs et al. 2007), and derived allele frequencies were determined by direct counting in the four populations. The following statistics were calculated using DnaSP 4.0 (Rozas et al. 2003): (1) Ka/Ks––the ratio of the number of nonsynonymous substitutions per nonsynonymous site (Ka) to the number of synonymous substitutions per synonymous site (Ks); (2) nucleotide diversity; (3) sequence divergence from chimpanzee; (2) and (3) were also calculated using a sliding window with window length 200 bp and step size 100 bp; (4) allele frequency spectrum summary statistics—Tajima’s D (Tajima 1989), Fu and Li’s D, and F (Fu and Li 1993) and Fay and Wu’s H (Fay and Wu 2000); (5) coalescent simulations to evaluate the significance of the results from (4) under a simple neutral model; (2), (4), and (5) were calculated for the whole region and for the OXT gene alone; and (6) the Hudson-Kreitman-Aguadé (HKA) test (Hudson et al. 1987) was used to compare divergence and diversity at OXT with corresponding values from a neutral 498 kb ENCODE region ENr321 (Chr8:118882220-119382220) from chromosome 8q24 (Birney et al. 2007). ENCODE data were only available from the YRI, CHB, and CEU populations, and so the LWK data were omitted from this analysis. As no SNPs were found in the OXT coding region, we omitted the coding region of the single gene annotated in ENr321 from this analysis, but similar results were obtained if it was retained. The chimpanzee–human alignment needed for this test was downloaded from Ensembl (http://www.ensembl.org/Homo_sapiens/alignsliceview?c=8:119132220.5;w=500000;align=opt_align_228) and the divergence calculated by DnaSP. The population differentiation at individual sites was measured using F ST (Schneider et al. 2000). Linkage disequilibrium (LD) was investigated using Haploview 3.32 (Barrett et al. 2005). Haplotypes were inferred from the genotype data using PHASE 2.1 (Stephens et al. 2001), and a median-joining network was constructed from the resulting haplotypes using NETW4.1.1.2 (Bandelt et al. 1999) (http://www.fluxus-engineering.com/sharenet.htm).

Results

We first evaluated the evolutionary pressures acting on the OXT coding region over the last few million years by examining the Ka/Ks ratio calculated from the human, chimpanzee (The Chimpanzee Sequencing and Analysis Consortium 2005), and rhesus macaque (Gibbs et al. 2007) reference sequences. This value will be dominated by the neurophysin I sequence because of its length and was 0.51. This is higher than the genomewide average of 0.25 (Gibbs et al. 2007) but lower than the neutral value of 1. It therefore indicates constrained evolutionary change, although with less constraint than the average protein, and supports the idea that the amino acid sequence of neurophysin I is functionally important.

We then determined the sequence of a 2.1-kb region encompassing the gene (Chr20: 2999650-3001750, Fig. 1) in 91 individuals from four populations and one chimpanzee. We found 14 variable positions in humans, all of which were base substitutions and lay in the flanking or intronic regions; none were found in the coding region (Table 2; ESM Table 3). Six of the SNPs were present in the Single Nucleotide Polymorphism Database (dbSNP), but the other eight were novel, and all of these were confirmed by genotyping (ESM Fig. 1). As expected, the novel SNPs were lower in frequency than the known ones but included two present at >10% frequency in individual population samples. Six dbSNP entries were not found in our sample: one of these was rare, and the others have not been reported in population surveys so may either represent other rare variants or erroneous database records (Table 2). The average nucleotide diversity for the region was 8.6 × 10−4 and was slightly lower for the OXT gene alone (4.5 × 10−4) (Table 3), but both these values were within the normal range (e.g., Akey et al. 2004). Strikingly, both nucleotide diversity and divergence from chimpanzee varied across the region, being low upstream of the gene and in the exons but higher in intron 1 and downstream of the gene (Fig. 1). On this time scale, the 5' region was almost as constrained as the coding region.

Fig. 1
figure 1

Oxytocin/neurophysin I (OXT) gene structure and variation. a Location of regions used for analysis, exons, and coding regions, and single nucleotide polymorphisms (SNPs) discovered. b Conservation, illustrated by the “Mammal Cons” and “Platypus” tracks from the University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu/cgi-bin/hgTracks?org=human). c Distribution of nucleotide diversity within humans, and d divergence between human and chimpanzee along the sequence using a sliding window with length 200 bp and step size 100 bp

Table 2 Characteristics of oxytocin/neurophysin I (OXT) variants in four HapMap populations
Table 3 Oxytocin/neurophysin I (OXT) gene summary statistics

We next investigated whether the patterns of variation within the four populations were consistent with a neutral evolutionary history or indicated the action of positive or balancing selection. The allele frequency spectrum statistics examined (Tajima’s D, Fu and Li’s D, Fu and Li’s F, and Fay and Wu’s H) can show unusually low values in response to positive selection or high values in response to balancing selection but were all consistent with neutrality, both in individual populations and in the worldwide sample (Table 3). The degree of population differentiation (F ST value), which can be high if positive selection has acted differentially on populations, was not observed to be unusually high, ranging from 0 to 0.17 (ESM Table 4). Eighteen inferred haplotypes were present, and LD was extensive (D′ = 1) through most of the region (ESM Fig. 2). A median-joining network linking the haplotypes was therefore constructed and showed a compact structure with limited reticulation, as expected from the LD pattern (Fig. 2). The three most common haplotypes were all present in all populations, consistent the lack of population differentiation shown by individual SNPs.

Fig. 2
figure 2

Median-joining network of oxytocin/neurophysin I (OXT) haplotypes. Circles represent haplotypes and lines the mutational steps that distinguish them. Numbers on the lines refer to single nucleotide polymosrphism (SNP) positions in Table 2. Circle area is proportional to haplotype frequency, and circles are color coded according to population

Finally, we assessed whether the level of variation in humans was as expected from the amount of divergence from chimpanzee, as it would be under a strictly neutral model of evolution. For this, we used an HKA test in which OXT was compared with other neutral resequenced regions. In order to provide power, we used 500 kb resequenced by the ENCODE project (Birney et al. 2007) in three of the same populations. More variation was found at OXT than expected (P = 0.024; Table 4), and in view of the large comparative region analyzed, this is a robust result.

Table 4 Results of the Hudson-Kreitman-Aguadé (HKA) test

Discussion

We carried out a thorough investigation of the genetic variation in and around the OXT gene. We were particularly interested in two aspects of this: whether or not variants likely to influence function were present, and whether or not the region showed a neutral evolutionary history.

Functional variants are located most obviously in the protein-coding regions of a gene, but evolutionary analyses show that many noncoding sequences are conserved to the same extent as coding sequences and thus are also likely to be functionally important. No OXT coding variants were found, but two novel low-frequency SNPs were discovered in the 5′ flanking region in African populations, one of which (at 3,000,148) lies in the most highly conserved segment outside the exons (Fig. 1b). Comparison with a species such as platypus shows that the downstream region is overall as conserved as the upstream (44% downstream, 46% upstream; Table 5), although this conservation is broadly distributed and lacks highly conserved segments (Fig. 1b). It is therefore possible that SNPs in one or both regions might contribute to variation in the expression of the gene.

Table 5 Comparison of human and platypus sequence near the oxytocin/neurophysin I (OXT) gene

Most of the evolutionary tests suggested a neutral evolutionary history of the region, ruling out strong recent positive or balancing selection. The HKA test, however, revealed a larger number of variants than expected when the evolutionary divergence rate was taken into account. This finding is in one respect conservative, because the evolutionary divergence may be overestimated when a single chimpanzee is used for comparison. The result can be viewed in two ways: It could be argued that as we applied five tests (Tables 3, 4), a strict Bonferroni correction would require a P value of 0.01 for significance, and thus our observed P value of 0.024 is not significant; or the excess variants are within the range expected by chance. However, a Bonferroni correction is probably too conservative in this case, because the tests in Table 3 all examine related aspects of the allele frequency spectrum and so are not independent. According to this view, the excess of variants is above that expected by chance and, to speculate further, increased diversity of 3′ regulatory elements might have been selected for, perhaps leading to a range of OXT expression levels in the population.

It is now possible to design in vitro experiments to study the regulation of OXT expression and association studies to relate its variation to phenotypes of interest, such as plasma oxytocin level or autism. It is striking that LD in the 100-kb region surrounding OXT is very low in all HapMap populations and is even incomplete for position 3,001,514 within the short region resequenced, so tagging would be inefficient and all SNPs need to be identified and genotyped directly in association studies. The new SNPs discovered in this study provide additional material for such projects.