INTRODUCTION

Specific language impairment (SLI) is a profound deficit in the acquisition of language despite adequate intelligence and opportunity, in the absence of any possible medical aetiology.1 This disorder is a common developmental condition affecting between 5% and 8% of pre-school children, and thus places a heavy burden upon health-related and educational services.2 It is well documented that SLI has a strong genetic basis (reviewed by Stromswold3). However, it is proposed that susceptibility to this disorder is complex in nature involving multiple genes, in combination with environmental factors.4 The genetic basis of complex disorders are notoriously difficult to characterise, as the contributing factors can vary greatly between affected individuals and may be masked by undetermined environmental effects. This is reflected in the fact that, to date, only four genetic loci5, 6, 7 and three associated candidate genes8, 9 have been described for SLI (OMIM no. 606711 (SLI1), OMIM no. 606712 (SLI2), OMIM no. 607134 (SLI3), OMIM no. 612514 (SLI4), OMIM no. 612514 (CNTNAP2, SLI4) OMIM no. 613082 (ATP2C2, in SLI1) and OMIM no. 610112 (CMIP in SLI1)).

Isolated founder populations can provide an important resource in the identification of causal genes underlying complex disorders.10 Such populations are derived from a small number of relatively recent ancestors and thus are relatively homogeneous, a point which can greatly assist gene mapping processes.11 Furthermore, one may postulate that loci identified in founder populations may hold more relevance to the general population than those yielded by the study of rare monogenic forms of impairment. In 2008, Villanueva et al12 described a Chilean founder population with an increased incidence of SLI (known as TEL in Spanish-speaking countries). This population inhabit the Robinson Crusoe Island, which forms part of the Juan Fernandez Archipelago, 677 km to the west of Chile, South America. Robinson Crusoe Island is the only inhabited island in the archipelago and has 633 residents. The most recent colonisation dates to the late nineteenth century when the island was repopulated by a group of eight families. A total of 77% of the current population has at least one of the colonising surnames supporting a high degree of consanguinity. Linguistic profiling of the colonising children indicated that 35% met current criteria for SLI (expressive or comprehensive language >2SD below that expected for their age), 27.5% had language deficits associated to other pathologies (eg, delayed psychomotor development, intellectual deficit or auditory impairment) and 37.5% displayed normal language skills.12 In contrast, the frequency of SLI in the non-colonising children (3.8%), coincided with that reported for mainland Chile (4%).13 Genealogical reconstruction indicated that 75% of known affected individuals were descended from a single pair of founder brothers.12 This population therefore represents a rare resource, which may be valuable in the identification of genetic loci contributing to susceptibility to SLI.

In this study, we perform genome-wide loss of heterozygosity mapping and parametric and non-parametric linkage analysis of the Robinson Crusoe population. We identify five regions (on chromosomes 6, 7, 12, 13 and 17) that meet genome-wide significance, and several loci, which are consistently implicated across alternative analyses. We hypothesise that these regions may contain variants that underlie the high prevalence of SLI observed in this isolated population.

Subjects and methods

This work was approved by the ethics department of the University of Chile. Informed consent was given by all participants and/or, where applicable, their parents.

DNA was extracted from EDTA whole blood samples collected from all available SLI and language-normal probands and their immediate families (125 individuals from 34 families, Table 1) using a standard chloroform extraction protocol.

Table 1 Sample structure

All Island inhabitants between 3 and 8 years, 11 months of age (n=66) were subjected to a linguistic battery, which included tests of phonology (Test para Evaluar Procesos de Simplificación Fonológica (TEPROSIF)14) and expressive and receptive morphosyntax (Toronto Spanish Grammar Exploratory test15). Any child who performed >2SD below that expected for their age was classified as having SLI. Exclusion criteria included non-verbal IQ (Columbia Mental Maturity Scale) below the 80th percentile, hearing disability, motor or structural abnormalities (Oral Motor and Speech Examination16) and a co-morbid diagnosis of autism, emotional difficulties, or neurological disorders (as assessed by medical history). Following proband ascertainment, available family members were assessed for the presence of SLI. Individuals who fell outside the age-range of available standardised tests (3 and 8 years 11 months) were assessed through a family history interview17 and tests of verbal fluency (Barcelona test18), verbal comprehension (Token test19), non-verbal intelligence (Raven's progressive matrices20) and auditory screening. The identification and classification of probands formed part of the descriptive study by Villanueva et al (2008). As this previous manuscript was in Spanish, detailed assessment descriptions are provided as Supplementary Material.

The present study considers only families derived from colonising families of the Robinson Crusoe Island (ie, at least one ancestor related to a founder member).

Genotyping

DNA was quantified by a pico-green assay (Quant-iT, http://www.invitrogen.com). In total, 125 samples were genotyped on the Illumina HumanLinkage-12 panel following the multi-sample Infinium II assay (http://www.illumina.com). These beadchips allow the genotyping of 6090 genome-wide single nucleotide polymorphisms (SNPs) and simultaneously analyse 12 DNA samples.

Quality control procedures

All genotypes were called within Beadstudio (Version 3, Illumina Inc., San Diego, CA, USA). Any SNP with a gentrain score below 0.9 was manually inspected and if, necessary, the clusters adjusted. A total of 18 samples were duplicated across arrays. Any SNP with a gentrain score below 0.5 (n=27), a call rate below 0.97 (n=4) or a minor allele frequency below 2.5% (n=2) was excluded from further analyses.

All called genotypes were subjected to a haplotypic error detection algorithm in MERLIN.21 All identified unlikely genotypes (P<0.001) were re-examined and, if necessary, excluded. Probabilities of Hardy–Weinberg Equilibrium (HWE) were calculated within PEDSTATS22 and any SNP with a HWE-p <0.001 (2 of 5666 SNPs examined) was identified for cautious treatment in the remaining analyses.

Allele-sharing between individuals was examined using the Graphical Representation of Relationships (GRR).23 This software calculates mean Identity by State (IBS) values for all possible pairs of samples and clusters individuals accordingly. Any individual found to cluster outside the expected IBS values were further examined. This error checking stage identified two DNA samples that had been mislabelled and were therefore excluded.

Generation of linkage pedigrees

Genealogical information was collated from birth and marriage certificates, family names and parent and relative interviews. Known relationships between identified nuclear families and the relevant pair of founder brothers were reconstructed and examined within the Progeny software (www.progenygenetics.com) (Figure 1).

Figure 1
figure 1

Descendants of founder brothers. The majority of affected individuals were found to be descended from a single pair of founder brothers.

Homozygosity mapping

Genotype data from all affected individuals were analysed for loss of heterozygosity within PLINK.24 Sliding windows of 20-SNP genotypes were examined for runs of homozygosity. In all, 42 affected individuals from 23 nuclear families were examined including 2 affected sib-pairs, an affected trio of siblings and 3 affected half-sib-pairs. Previous studies have found that runs of homozygosity <4 Mb are common in outbred individuals.25 Segments were therefore defined as homozygous tracts if 10 homozygous SNPs were found to extend across a region greater than 4 Mb in size.

Linkage analyses

Genotype data were stored in the Integrated Genotyping System (IGS) database. Individuals were classified as affected or unaffected on the basis of linguistic testing as described in ‘subjects’. Data were analysed for linkage within MERLIN (autosomes) and the MERLIN extension, MINX (X chromosome).21 As linkage packages were unable to analyse genome-wide data for the 242-bit pedigree as a whole, it was broken into sub-pedigrees. This segmentation was manually performed on the basis of closest shared ancestor. Seven extended families of 20–24 bits (where a bit is defined as 2 × the number of non-founders−the number of founders) were defined and included 41 affected individuals and 63 of the 123 genotyped individuals (Figure 2). Although some individuals were present across multiple sub-pedigrees, all affected-relative pairs were only represented once. Genotype data for unaffected individuals were used for haplotype analyses (described below).

Figure 2
figure 2

Small pedigrees used for linkage analyses. The larger pedigree shown in Figure 1 was broken into seven smaller pedigrees of maximum size 24 bits for linkage analyses.

Parametric linkage analyses were performed under dominant and recessive models of linkage assuming a disease frequency of 35% (as described in the Robinson Crusoe population) and full penetrance. As the model of inheritance for SLI is unknown (and not expected to be monogenic in most instances) we also performed non-parametric analyses. Although explicit input parameters are not necessary for the completion of non-parametric analyses, expected allele frequencies must be specified. In this study, because of the isolated nature of the population, we had no directly appropriate control data and therefore performed three non-parametric analyses using alternative allele frequency estimation strategies. First, we used allele frequencies of all genotyped individuals (n=123). These individuals are derived directly from the population under study and can therefore be expected to provide representative expected allele frequencies. Nonetheless, these data are derived from related individuals and can therefore lead to a bias. We therefore repeated the analyses using allele frequency data from genotyped founder individuals of the generated sub-pedigrees (ie, those who marry into the pedigree, n=9). Although this reduces the dependence between individuals, it relies upon a small number of data points. We therefore also performed linkage analysis using allele frequency data from 60 unrelated CEPH individuals. The Y chromosome SNP data of the Robinson Crusoe population indicated that the founder males were European in origin (data not shown). Non-parametric results are reported as NPL scores and threshold levels for genome-wide significance are in line with that suggested by Kruglyak and Lander.26 Namely, NPL scores of >3.8 (P=7.4 × 10−4) are described as suggestive linkage, NPL scores >4.08 (P=2.2 × 10−5) as significant and NPL scores >4.99 (P=3.0 × 10−7) as highly significant. Using a Bonferroni multiple testing correction for the three non-parametric analyses run, these thresholds equate to P=2.46 × 10−4, P=7.3 × 10−6 and P=1.0 × 10−7, respectively. In this instance, we expect the Bonferroni correction to be over-conservative because of the high-expected correlation between the three analyses.

Haplotype analyses

Haplotypes were reconstructed for the chromosome 7 region of linkage within nuclear 2-generation families using MERLIN.21 Two-SNP sliding windows were visually inspected for allele combinations that co-segregated with affection status. All haplotypes that were found to have odds ratios of >2.0 or <0.5 (n=5) were analysed for association within PLINK using all genotyped cases and controls under a linear model.24 In these analyses, no correction was made for the relationships between cases and controls. Association analyses of simulated data-sets yielded a distribution of empirical P-values that fit well with those expected under the theoretical model indicating that, in this particular case, the relationships between individuals do not inflate the significance of the results obtained (data not shown). Measures of linkage disequilibrium (LD) were calculated within haploview.27

Results

Pedigree reconstructions confirmed that of the 44 affected individuals from whom we had DNA, 37 (84%) were descendants of a pair of founder brothers (Figure 1), 3 (7%) were not related to the founder brothers and 4 (9%) had unknown ancestry. Following quality control, genotypes were available for 6009 SNPs (5666 autosomal) with an average spacing of one SNP every 490 kb. The average genotype call rate was 99.9%. The minimum SNP genotype rate was 94.3% and the minimum SNP heterozygosity was 4%. Two individuals (both affected) were excluded from the analyses yielding genotype data for 123 individuals with an average individual genotype rate of 99.9% and a minimum individual genotype rate of 99.2%. The genotype mismatch rate across duplicated samples was 0.0027% and two SNPs were found to have a Hardy–Weinberg P-value of <0.001.

Of the 42 affected individuals examined, 28 showed at least one tract of homozygosity. Across all affected individuals, an average of 13.1 Mb (median, 5.4 kb) of the genome consisted of homozygous tracts. In individuals whose parents were known to be first or second cousins (n=6), this figure increased to 26.3 Mb (median, 28.9 kb). No chromosome region was found to be homozygous in all affected individuals, but two chromosome regions were homozygous in four (10%) affected individuals. These comprised of a 4 Mb region of chromosome 11 and a 6 Mb region of chromosome 15, both of which were homozygous in a sib-pair and two additional unrelated individuals (Table 2). In total, 18 chromosome regions contained overlapping segments of homozygosity (Table 2).

Table 2 Homozygous segments shared between more than two affected individuals

No chromosome region reached parametric genome-wide significance (HLOD>3, Figure 3). Maximum HLODs were observed on chromosome 8 for the dominant model (rs1390950, HLOD=2.4, Figure 3) and chromosome 1 for the recessive model (rs1906255, HLOD=1.52, Figure 3). Under the recessive model, chromosome 15 gave HLOD scores marginally above 1 (maximum HLOD=1.05) in a region that was homozygous in four affected individuals (Table 3).

Figure 3
figure 3

Genome-wide linkage analyses. Traces are shown for parametric analyses using both dominant and recessive models with full penetrance and three non-parametric models utilising expected allele frequencies derived from CEPH population, from genotyped founders in the sub-pedigrees and from all genotyped individuals. Traces are also shown for identified stretches of homozygosity (where the X-axis represents the number of individuals found to be homozygous across the region).

Table 3 Chromosome regions highlighted by linkage analysis

Non-parametric linkage analyses identified five chromosome regions (chromosomes 6q, 7, 12, 13 and 17) that reached the threshold for genome-wide significance (NPL>4.08, P<2.2 × 10−5). Three of which (chromosomes 6q, 7 and 12) were highly significant (NPL>4.99, P<3.0 × 10−7) (Table 3, Figure 3), even after an over-conservative Bonferroni correction for the three non-parametric tests performed. The linkages to chromosomes 6q and 12 were only observed in a single non-parametic analysis whereas those to chromosomes 7, 13 and 17 were consistent across all non-parametric analyses performed (Figure 3, Table 3).

The most consistently implicated region was on chromosome 7. This locus achieved highly significant linkage under all three non-parametric models (max NPL=6.73, P=4.0 × 10−11) and contained a region, which gave a HLOD of 1.24 in the recessive parametric linkage analyses and a segment that was found to be homozygous in two affected individuals (Table 3, Figure 4). Linkage analyses within each of the sub-pedigrees, revealed that four families were contributing to linkage at this locus (linkage peds 3, 5, 6 and 7 (Figure 2), data not shown). Segregation analyses of two-SNP sliding window across this region identified five 2-SNP combinations that were present in at least 90% of affected individuals. Further investigations in all genotyped individuals, indicated that one of these haplotypes (rs727714/rs969356, AG) occurred at a significantly lower frequency in unaffected individuals than affected (Supplementary Table 1). The AG genotype of the rs727714/rs969356 haplotype was present in 98% of cases and 76% of controls and had an allele frequency of 67% in cases and 48% in controls (P=0.008). This association remains marginally significant (P=0.04) after the application of a Bonferroni correction. This haplotype covered 74 kb of sequence and coincided with the non-parametric (All) peak of linkage. It lay 2.5 Mb proximal to the SNPs with the highest NPL score in the two alternative non-parametric analyses (rs1524341 and rs1024676, D′=0.21–0.23, Tables 3 and 4) and was 2.6 Mb distal to the region of parametric linkage and 3 Mb proximal to a segment of homozygosity. Investigation of the LD structure indicated that the rs727714/rs969356 haplotype showed moderate (D′>0.4 and LOD>2) long-range LD with surrounding variants (Table 4), which may provide an alternative explanation for the association observed. One of the two haplotype SNPs (rs727714) falls in exon 3 of the NOBOX gene creating a synonymous base substitution.

Figure 4
figure 4

Chromosome 7. Chromosome 7 represented the most consistently linked locus across analyses. Traces are shown for parametric analyses using both dominant and recessive models with full penetrance, three non-parametric models utilising expected allele frequencies derived from CEPH population, from genotyped founders in the sub-pedigrees and from all genotyped individuals. Traces are also shown for identified stretches of homozygosity (where the y axis represents the number of individuals found to be homozygous across the region) and association of P-values (relative to the secondary y axis).

Table 4 SNPs that are in LD with associated haplotype

As expected, given the density of the panel used in this study, single SNP association across the entire region of linkage on chromosome 7 did not identify any significant associations (minP across linkage region=0.02, Figure 4). As single SNPs, rs727714 and rs969356 yielded association P-values of 0.04 and 0.13, respectively.

Discussion

In this paper, we perform genome-wide analyses of an isolated Chilean population affected by Specific Language Impairment (SLI). Homozygosity mapping and parametric linkage analyses did not identify any chromosome segments that co-segregate with SLI in this population, suggesting that a completely penetrant monogenic aetiology is unlikely. This hypothesis is further supported by the observed nature of the language impairments. Affected individuals do not present with a specific core phenotype as may be predicted under a monogenic model, but instead show extensive heterogeneity in the severity and nature of impairment between affected individuals, as is typical of complex genetic forms of SLI.

The most consistent region of linkage extended across 48 Mb of chromosome 7q (chromosome position 111 285 062–158 710 965). This region reached a maximum NPL score of 6.73 (P=4.0 × 10−11) and achieved genome-wide significance in all three non-parametric analyses performed and overlapped with a peak of parametric linkage (recessive model max HLOD=1.24) and two segments of homozygosity. Although these are not independent observations and a number of alternative analyses were performed, the reliability of the linkage in this region is consistent with that expected from a true positive.

Segregation analyses identified a two-SNP haplotype that was found at a marginally increased frequency in cases than controls (P=0.008). This haplotype fell across the NOBOX (OMIM no. 610934) and TPK1 (OMIM no. 606370) genes. NOBOX is a homeobox gene, which is preferentially expressed in oocytes, but not reported to be expressed in brain.28 TPK1 encodes the thiamine pyrophosphokinase 1 enzyme, which catalyses the conversion of thiamine to thiamine pyrophosphate. Thiamine (or vitamin B1) is essential for the metabolism of carbohydrates into glucose and acts as a co-enzyme in the production of acetylcholine. Thiamine deficiency forms part of numerous disorders including ataxia, confusion and impaired memory.29 Interestingly, a recent study suggested a link between thiamine deficiency and syntactic and lexical disorder.30 The chromosome 7 peak also overlaps with the AUTS1 locus of linkage to autism31 and includes both the FOXP2 and CNTNAP2 genes, both of which have previously been associated with language disorders.9, 32 The genotyping panels utilised in this study were optimised for linkage investigations and thus involve a relatively sparse map of SNPs (1SNP every 500 kb). The fine mapping of these regions is therefore required to enable the identification of candidates in an unbiased manner. We found that the two-SNP haplotype on chromosome 7 showed moderate long-range linkage disequilibrium with a number of SNPs indicating that further information would be required to narrow the linkage peak. Higher density SNP arrays would also enable the detection of smaller runs of homozygosity.

We did not observe any linkage to chromosomes 16 or 19, which have previously been implicated in SLI.5, 6, 33 Again, this may be caused by the low density of markers investigated in the present study. Alternatively, as the loci on chromosome 16 and 19 were identified by a quantitative genome screen of language-related measures, this may reflect differences in study design. As the Chilean quantitative linguistic data was collected only for subjects within a restricted age range (3 and 9 years), the current study utilised a binary affection status. This is similar to the approach applied by Bartlett et al (2002, 2004) in their genome screen for SLI in which they identified a region of linkage on chromosome 13 (SLI3), which overlaps with that found by the present study. This region has also been linked to autism,34 a result which was strengthened by the selection of families on the basis of linguistic data.35 Our chromosome 13 linkage consisted of two adjacent peaks. The distal peak (34–48 Mb) overlapped with a segment of homozygosity and achieved a maximum NPL score of 4.8 (P=8.0 × 10−7) using CEPH allele frequencies. The proximal peak (83–94 Mb) reached an NPL of 3.5 (P=0.0002) under all non-parametric analyses performed and coincided with an area of marginal linkage under a recessive parametric model.

In addition to the linkages on chromosome 7 and 13, we also observed significant linkage (NPL>4.08 (P<2.2 × 10−5)) to chromosome 17 and highly significant linkages (NPL>4.99 (P<3.0 × 10−7)) to chromosomes 6q and 12 (Figure 3, Table 3). However, these peaks were only observed under a single non-parametric model and not in models using alternative expected allele frequencies. It is therefore likely that these divergent results may be driven by differences in the allele frequencies of the control populations used and illustrate the importance of correctly estimating allele frequencies, especially for markers that are in linkage disequilibrium.36 Indeed, we found that the correlation of expected allele frequencies between the three different control groups was moderate (0.41–0.70 across all SNPs) and was lower than average across the conflicting regions of linkage on chromosome 6 and 12 (as low as 0.29 and 0.09, respectively), but remained moderate across the region of linkage on chromosome 7 (0.48–0.67). Importantly, simulation studies indicate that although allele frequency misspecification can lead to false positives, this artefact is not expected to affect the power to detect true linkages.37 Thus, although the loci on chromosome 6 and 12 reached a threshold of highly significant linkage, as these were observed with only one non-parametric analysis, we must recognise the possibility that they represent false positives, especially given the high number of tests performed. Instead, a more fruitful avenue of investigation may be provided by the examination of regions found to be consistently implicated across all three analyses performed, even in cases where this linkage did not reach genome-wide significance (eg, chromosome 2, 6p, 8, 9, 15 and 17. Table 3, Supplementary Figure 1).

In conclusion, this study has applied a genome-wide approach to identify loci which may contain genes underlying susceptibility to SLI in an isolated population. This study represents the first step in the detection of genetic variants that underlie the increased frequency of language impairments in this population. It is envisaged that the fine mapping of the identified loci will allow the detection of associated polymorphisms. It is likely that the variants identified by the further study of this population will have a significant role in furthering our understanding of the genetic basis of language impairments and language development.