Introduction

A set of cell lines developed by the Centre d’Étude du Polymorphisme Humain (CEPH) serves as one of the most widely used resources in cell biology. These lymphoblastoid cell lines were derived from 809 individuals in 62 three-generation pedigrees. In the 1990s, these lines were used extensively for human genome mapping studies1, 2, 3 (reviewed in Prescott et al4). These maps were created at relatively low resolution, with polymorphic markers identified at 5 to 15 centimorgan (cM) distances (5 to 15 Mb). Subsequently, 180 CEPH family samples were used as part of the International HapMap project, consisting of 60 trios that comprise grandfather/grandmother/parent members of three-generation pedigrees.5, 6 The CEPH collection has been utilized for a broad range of other applications such as assessing genetic variation underlying gene expression, studying allelic variation,7 and identifying cis or trans expression quantitative trait loci (eg Morley et al8 and Monks et al9).

Consanguinity is known to occur with varying frequencies among populations due to geographical and cultural factors.10 The offspring of consanguineous parents may have regions of homozygosity due to autozygosity. An individual is autozygous at a chromosomal locus if he or she inherits two copies of a single ancestral allele from consanguineous parents. The risk of recessive disorders is higher within inbred families because of the increased probability for two deleterious alleles that are identical-by-descent (IBD). At the population level, the consequences of consanguinity include changes in allele frequencies and the appearance of regions of homozygosity, with the potential to impact measurements of genetic variation in population data. One fundamental assumption of the CEPH project has been that pedigree structures are correct as annotated. Recently, evidence for inbreeding was established within a subset of the HGDP-CEPH panel.11

A variety of methods are available to determine relatedness between individuals based on genotype data. We adopted three complementary approaches. First, we used identity-by-state (IBS) methods in which SNP genotypes are compared between a pair of individuals at each chromosomal position, typically involving 900 000 comparisons per pair. We plotted IBS sharing and generated characteristic profiles for a variety of relationship types.12, 13 Furthermore, pairwise relationships between all members of a study population can be plotted to distinguish relatedness according to the autosome-wide amount of IBS2* (defined as AB/AB sharing) divided by the sum of IBS2* and IBS0. Such a ratio (IBS2*_ratio), suggested by Lee14 and related to an approach by Rosenberg,15 reduces to a value of 1 for parent–child and identical samples (for whom there are essentially no IBS0 calls), and to a value of 2/3 for unrelated individuals. We previously validated Lee's approach, showing that IBS2*_ratio values of 2/3 corresponded to related individuals, as expected.16 We note that other factors such as heterozygosity rates can impact these measurements, with some pairs of individuals having IBS2*_ratios of 2/3 being related.

A second approach for determining relatedness involves IBD estimation. IBD0, IBD1, and IBD2 states can be inferred from a subset of their corresponding IBS states. There has been a large amount of work in relationship estimation using IBD. Recent algorithms include FastIBD,17 a revised maximum likelihood estimator that includes Fst as a parameter,18 and ERSA.19 We define K0, K1 and K2 as estimates of Cotterman coefficients of relatedness (k0, k1, k2), which allow the inference of the degree of relatedness between individuals.20, 21 We recently validated our IBD approach (kcoeff)16 against the IBD estimates given by PLINK.22 In this study, we compare kcoeff's IBD estimates to those provided by PREST23 and annotations of relationships given by the likelihood-based methods of RELPAIR.24

A third approach involves determining regions of homozygosity based on the absence of AB genotype calls in contiguous stretches. Our study extends work by Broman and Weber25 who measured short tandem-repeat polymorphisms in 134 individuals from eight CEPH families. They identified long stretches of homozygous markers, particularly in CEPH/Venezuelan pedigree 102 and CEPH/Amish pedigree 884. They interpreted these as due to the mating of closely related individuals (autozygosity) rather than linkage disequilibrium in the population. The present study includes analyses of three of the same extended families. Broman and Weber25 propose that long regions of homozygosity due to autozygosity are common in human genomes. Our combined analyses indicate the presence of relatedness both within and between pedigrees and shows that the majority of individuals with homozygosity are from inbred populations. This suggests that there are relatively few protracted regions of homozygosity in outbred populations.

Materials and methods

CEPH genotype data

Genomic DNA from 186 individuals in CEPH pedigrees (families 35, 66, 102, 104, 884, 1331, 1356, 1400, 1416, 1424, 1427, 1477, 1582) was previously obtained by the Coriell Institute for Medical Research (CIMR) and used to generate data on the Affymetrix 6.0 genotyping platform (n=934 940 SNPs). We obtained these data and filtered them to include only autosomal SNPs (n=871 166). SNP data for 181 samples were deposited by CIMR in the NIH Database of Genotypes and Phenotypes under study accession phs000268.v1.p1. Nine of the samples (NA07340, NA12248, NA12249, NA10835, NA10845, NA11930, NA11931, NA11932, NA11933) overlap with HapMap III.26

IBS and IBD analyses

We analyzed IBS with SNPduo (a web-based program that generates plots and tables of IBS sharing across chromosomes12), SNPduo++ (a command-line program used to analyze all 17 205 pairwise comparisons between the 186 samples12, 14), and Partek Genomics Suite (version 6.5; St Louis, MO, USA) to obtain IBS2*_ratio values.14, 16

We analyzed IBD with kcoeff software to estimate Cotterman coefficients of relatedness metrics K0, K1, and K2.16 The algorithm, kcoeff, uses an IBS0_ratio (IBS0/(IBS0+IBS2*)), which is related to the IBS2*_ratio, when calculating K0, K1, and K2. Concordant homozygous SNPs were removed for each pairwise comparison resulting in an average of 419 297 informative SNPs.

Homozygosity analyses

We developed an algorithm (called hetSNP) in Perl that employed an SNP by SNP sliding window approach to identify regions of homozygosity for every individual in a population. For each window, the percentages of homozygous, heterozygous, and No Call (NC) alleles were calculated. Minimal homozygous regions were defined as windows of 200 SNPs containing ≤1% heterozygous alleles and ≤5% NCs. Overlapping homozygous regions were combined into a single region. Homozygous regions ≥3 Mb and ≥800 SNPs were reported. This region size was selected to define informative regions, facilitating SNPduo analysis.

Homozygosity and distant IBD

Our IBD method is robust for inferring relationships with an estimated K1≥0.03. Pairwise comparisons below this K1 estimate may correspond to true distantly related individuals or truly unrelated individuals. To support potential relatedness we used SNPduo to identify chromosomal regions (blocks) lacking IBS0 (implying IBD1 sharing). Truly unrelated individuals are expected to have K1 estimates of zero that may not always be exactly zero due to a window approach in which some regions of little variability have fewer IBS0 calls. To determine whether relationships were present that involved stretches of homozygosity, we applied the following criteria, of which the first four were necessary:

  1. 1)

    A region lacking heterozygous (AB) calls in a child across a segment ≥3 Mb and ≥800 SNPs.

  2. 2)

    A corresponding parental region lacking IBS0 calls, likely representing relatedness between the parents. This region must be equal to or larger than the segment lacking heterozygous calls in the child.

  3. 3)

    IBD2 sharing between a parent and a child supporting abnormal relatedness between the parents.

  4. 4)

    SNP intensity data indicating a euploid copy number.

  5. 5)

    For large sibships, such as those in CEPH families, multiple siblings (on average one quarter) are expected to have a lack of AB calls in the regions of inbreeding.

  6. 6)

    For individuals who are candidates for inbreeding, the occurrence of autozygosity on multiple chromosomal loci provides additional support.

RELPAIR and PREST analyses

We analyzed relatedness using RELPAIR as described.24 We excluded chromosomes X, Y and M (mitochondrial SNPs) before implementing the PLINK ‘-thin’ command (n=25 times) to randomly select SNPs. The final output consisted of 1412 (out of 17 205) comparisons that were assigned at least one of the following relationships in 1 out of 25 runs: monozygotic twins, parent–offspring, full-siblings (FS), avuncular (AV), grandparent–grandchild (GG), half-siblings, or cousins (CO). Relationships not specified as described above were assigned unrelated (UN) status.

We also analyzed relatedness with PREST using the ‘--aped’ and ‘--wped’ options.23 The following quality control measures were employed using PLINK:22 (1) individuals with ≤98% genotype call rate were removed; (2) SNPs with ≤90% genotype call rate were removed; (3) SNPs failing Hardy–Weinberg equilibrium (HWE) with a P≤0.0001 were removed; (4) SNPs with a minor-allele frequency ≤0.01 were removed. Zero individuals were removed, whereas 3130 SNPs with low call rate, 114 850 SNPs with low minor-allele frequency , and 820 SNPs that failed HWE were removed. A total of 753 418 remaining SNPs were pruned within PLINK using the ‘—thin’ command, providing 45 451 SNPs for the PREST analysis. We note that as some samples were duplicated, it was impossible to accurately specify annotated relationships for PREST input files.

Results

Unexpected sharing in CEPH pedigrees

We obtained high-density SNP genotype data from a set of 186 CEPH individuals comprising 13 separate families. To determine the genetic relatedness of these individuals, we measured both IBS and IBD for every pairwise comparison (n=17 205 pairs) using autosomal data (see Materials and Methods). Each data point of an IBS2*_ratio plot consisted of a single pair of individuals (Figure 1a). The x-axis (IBS2*_ratio) included values of (IBS2*/(IBS0 +IBS2*)) where IBS2* denotes AB/AB genotypes. The y-axis included measurements of K1 (Figure 1a), K2 (Figure 1b), and K0 (Figure 1c) using the kcoeff method16 that estimates Cotterman coefficients of relatedness.

Figure 1
figure 1

Relationships among CEPH three-generation pedigree members based on IBS and IBD measurements. (a) IBS2*_ratio plot annotated by relationships. Each data point corresponds to a comparison of two individuals based on genotype data. The IBS2*_ratio consisted of autosome-wide (IBS2*/(IBS0+IBS2*)) on the x-axis measured against kcoeff's K1 (level of genome shared IBD1) on the y-axis. Clusters were expected (based on prior sample annotation) of identical, parent–child, full-siblings, 1/4 sharing (ie, AV and GG), and unrelated individuals. We also observed pairs of samples having x-axis values consistent with distant relatedness (eg, arrows 1 and 2). (b) IBS2*_ratio (x-axis) versus kcoeff's K2 (level of genome-shared IBD2; y-axis), annotated by relationships. (c) IBS2*_ratio (x-axis) versus kcoeff's K0 (level of genome shared IBD0; y-axis).

On the basis of available pedigree information, we expected to observe three identical sample pairs, 317 parent–child, 522 full-sibling, 506 one-quarter sharing divided into 386 grandparent–grandchild and 120 AV pairs (includes all AV and materternal relationships; inferred by placement of known identical samples within their respective pedigree), and unrelated individuals. Identical and parent–child relationships had expected IBS2*_ratio values of 1.0 owing to few IBS0 calls (Figures 1a–c), but were separated along the y-axis because identical samples are solely IBD2 (Figure 1b; see arrow). Unexpectedly, we observed a fourth pairwise comparison that segregated to a position indicating identical samples (Figure 1a and b; note identical samples overlap). Three pairs had been previously annotated as part of CEPH/Venezuelan pedigree 102 and 104, whereas the fourth pair was annotated as a grandmother–granddaughter relationship (NA12863 and NA12859 in CEPH/Utah pedigree 1400). We concluded that both of these CEPH/Utah samples were derived from the granddaughter (based on sibling relatedness on the IBS2* plot and K1 and K2 estimates; see below). It was subsequently confirmed that the granddaughter's DNA sample had been genotyped twice (Dr Norman Gerry, Coriell Cell Repositories, personal communication).

We confirmed 317 parent–child relationships based on IBS and IBD estimates. As expected for relationships with no IBS0, IBS2*_ratio values were near 1.0 (Figure 1a). Notably, some parent–child relationships also had appreciable levels of IBD2 (K2, Figure 1b) and IBD0 (K0, Figure 1c) that will be discussed in detail below.

Siblings who have theoretical Cotterman coefficients of 1/4 IBD0, 1/2 IBD1, and 1/4 IBD2, had IBS2*_ratio values near 0.90 and were distinct from other relationships (Figures 1a and b) because of the presence of IBD2 sharing. A total of 522 full-sibling pairs were confirmed based on IBS estimates as well as IBD1 and IBD2 estimates.

Pairwise relationships involving one-quarter sharing included AV and GG comparisons based on pedigree annotations. These pairs had IBS2*_ratio values that ranged from 0.77 to 0.86 (Figure 1a). Furthermore, IBD analysis for these 506 pairs matched expected Cotterman coefficient values with estimates centered on 1/2 IBD0 and 1/2 IBD1 (K1, Figure 1a; K0, Figure 1c). In addition to the unexpected IBD2 sharing estimated in parent–child relationships, some GG and AV relationships were also inferred to have IBD2 sharing (Figure 1b). These will be explained alongside parent–child relationships with IBD2 in detail below. We generated a complete list of IBD estimates for annotated pairwise comparisons (Supplementary Table 1).

Pairs of individuals who were annotated as unrelated are expected to have IBS2*_ratio values of 2/3.14, 16 Values >2/3 can be attributed to genetic relatedness or to elevated heterozygosity in one (or both) individuals.16 We therefore rely on the kcoeff method to identify distantly related individuals. Unexpectedly, we observed a cluster of pairwise comparisons with K1 values ≥0.03 indicating distant relatedness (Figure 1a; see arrow). In some instances K1 values ranged from 0.166 to 0.244, consistent with theoretical Cotterman coefficients of 0.25 (eg, first-cousins) or 0.125 (eg, first-cousin once removed). These included grandfather/grandmother couples (NA12977 and NA12978) from pedigree 1427 (Figure 1a, arrow 1; K1 value of 0.244) and (NA13180 (duplicate sample NA13055) and NA13181 (duplicate sample NA13057)) from pedigree 102/104 (arrow 2; four comparisons; K1 value of 0.166). Another notable pair, paternal grandmother NA11931 and maternal grandmother NA11933, from pedigree 1424 had a K1 of 0.13. It is important to note that this pair is present in HapMap 3 and represents an unannotated related pair. Each pair had regions that lacked IBS0 based on SNPduo analyses, supporting the finding of distant genetic relationships (data not shown). A complete list of individuals inferred to be related is presented in Supplementary Table 2.

Increasing relatedness between pairs of individuals was associated with decreasing K0 estimates (Figure 1c). Given that the IBS2*_ratio includes IBS0 information in the denominator, it is expected that a decrease in IBS0 results in a higher IBS2*_ratio. Furthermore, the estimated level of K0 should also decrease as the level of IBS0 is reduced. Note that some parent–child relationships have estimated IBD0 values that will be discussed below.

IBS confirmation of IBD relatedness findings

To confirm IBD1 (displayed in Figure 1a) or IBD2 sharing (Figure 1b) based on the genotype data, we analyzed relationships on a chromosome-by-chromosome basis in SNPduo to determine and visualize the extent of IBS sharing. We applied this to Amish individuals NA13113 and NA13114 from pedigree 884, in which unexpected sharing was detected (K1=0.092; Figure 2a). The IBS sharing between these two parents included many extended regions with a lack of IBS0 calls (eg, regions 1 and 8). Furthermore, each of the four grandparents (ie, the parents of NA13113/NA13114) in the pedigree shared K1 values ranging from 0.051 to 0.092 with respect to the other three (NA13111, NA13112, NA13115, and NA13116). As a consequence, the genomes of NA13113 and NA13114 had extensive tracts of homozygosity (Figures 2b and c, regions 6 and 7) as previously reported.25 These regions of homozygosity (due to autozygosity) were not shared, thus resulting in observed states of either IBS0 or IBS2 (Figure 2a, region 5).

Figure 2
figure 2

Inbreeding in a CEPH/Amish pedigree. For all panels, data for chromosome 6 are shown based on SNPduo analyses.12 Upper three panels: (a) pairwise IBS patterns are presented for parents NA13113 and NA13114. Note regions 1 and 8 in which an absence of IBS0 is shown. Region 5 corresponds to regions 6 and 7 (panels b and c) and represents two individuals with different stretches of homozygosity (lack of AB calls) compared with each other; whereas both were homozygous, the differences were evident from the occurrence of IBS0 in the region 5. (b) Genotypes of NA13113. Note homozygosity in region 6. (c) Genotypes of NA13114. Note homozygosity in region 7. (d) Pairwise IBS patterns for NA13117 and NA13127 (children of NA13113 and NA13114). Note that region 2 indicates IBD2 between the siblings NA13117 and NA13127 and overlaps a region of inferred IBD1 between the parents (region 1). (e) Genotypes for NA13117. Note that regions 3 and 9 are homozygous segments that correspond to a lack of IBS0 in regions 1 and 8 from panel A; this suggests consanguinity. (f) Genotypes of NA13127. Region 4 is a homozygous segment that corresponds to region 1 and is identical to region 3. This supports consanguinity due to lack of IBS0 between the parents (a).

Given the relatedness between the parents, we expected to observe homozygous segments in the offspring in regions where the parents were related. We plotted the results of SNPduo analysis for a representative pair of full-siblings (NA13117 and NA13127) for chromosome 6 (Figure 2d) in which there was IBD2 sharing (absence of IBS0 and IBS1, region 2) in a region corresponding to relatedness between the parents (region 1). We observed homozygosity in this same region in the children (Figures 2e and f, regions 3 and 4). We highlight a second example of homozygosity in child NA13117 (Figure 2e, region 9) caused by relatedness between the parents (Figure 1a, region 8).

Identification of distantly related individuals based on homozygosity in pedigrees

In addition to using IBS and IBD to define distantly related individuals, we further identified individuals who were inbred using analyses of homozygosity in the context of pedigrees. We applied our analysis to individuals for whom parental genotype data were available and also applied other criteria (see Materials and Methods). A notable discovery involved pedigree 1582. A region lacking IBS0 between the paternal grandfather (NA12921) and maternal grandmother (NA12924) was observed that overlaid a region of homozygosity in the grandchildren (NA12915, NA12917, NA12918, and NA12919). This occurred on chromosome 1 and spanned 5.26 Mb and involved 1600 SNPs. We summarize our homozygosity findings for all pedigrees in Figure 3 and our IBS/IBD estimates for related founders in Table 1 as well as a complete list of the 86 individuals inferred to be related (Supplementary Table 2). We confirmed a copy number state of 2 in the regions of homozygosity (data not shown).

Figure 3
figure 3

Revised CEPH pedigrees. Curved lines denote significant relatedness between pairs of individuals with estimated autosome-wide K1 and pIBD1 values indicated in Table 2. Dashed rectangles correspond to pedigrees as given on the Coriell Cell Repositories website,34 except that CEPH/Venezuelan pedigrees 102 and 104 had not previously been explicitly presented as a single pedigree visually. Numbers given on the pedigrees for individuals correspond to standard Coriell designations (eg, 11930 corresponds to cell line GM11930 or DNA sample NA11930). (a) CEPH/Utah pedigrees 1331 and 1424 were interrelated based on K1 sharing between grandmother NA07050 and grandfather NA11932 on chromosome 4. Additionally, a significant K1 value was estimated between paternal grandmother NA11931 and maternal grandmother NA11933 in CEPH/Utah pedigree 1424. (b) CEPH/Utah pedigree 1427 had significantly elevated K1 between NA12977 and NA12978 and marginal K1 between all four grandparents supported by homozygosity and SNPduo analysis (see Materials and Methods). (c) CEPH/Venezuelan families 102 and 104 include numerous AV relationships. All four grandparents displayed elevated K1 levels. (d) CEPH/Utah pedigree 1582 had minimal K1 levels associated with homozygosity between NA12921 and NA12924 (supported by SNPduo analysis). (e) CEPH/Amish pedigree 884 was characterized by four grandparents with elevated K1 levels. (f) CEPH/Utah pedigree 1356 had minimal K1 levels also associated with homozygosity and supported by SNPduo analysis. *Indicates the presence of homozygosity due to autozygosity; #indicates the presence of multiple regions; any number within a square or circle indicates a unique chromosome in which relatedness or homozygosity was present; @indicates the members of CEPH/Venezuelan pedigree 104 within the combined CEPH/Venezuelan pedigrees 102 and 104.

Table 1 k1 estimates for relationships within and between families reported in Figure 3

We compared the amounts of homozygosity we identified in all individuals to those reported by Broman and Weber25 (Table 2). We also report amounts of homozygosity in individuals not studied by Broman (listed in Table 3). Notable individuals included NA11035 from pedigree 104 who had 86 Mb of homozygous regions and NA12969 (daughter of NA12977/NA12978; Figure 1c, arrow 1) from pedigree 1427 who had 268 Mb total from 16 homozygous regions. A brief comparison of regions inferred to be homozygous by the kcoeff method and from Broman and Weber is presented in Supplementary Figure 1 in which we report comparable results, but better define the boundaries due to a greater number of markers. A complete list of individuals with the chromosome and position of each region is presented in Supplementary Table 3.

Table 2 Comparison of estimates of homozygosity in this study to Broman and Weber25
Table 3 Individuals with new estimates of homozygosity

Comparisons to RELPAIR and PREST

We compared our analysis method to that of RELPAIR,24 a leading software package that has recently been used to annotate relationships in HapMap Phase III.27 We used RELPAIR to analyze all pairwise relationships using 25 independent runs for each comparison (see Materials and Methods). We note that although RELPAIR identified all identical and full-sibling relationships, it also called several annotated parent–child and second-degree relationships as full-siblings (in particular, those that had unexpected IBD2 estimates). These apparently misclassified individuals were within those pedigrees in which inbreeding has already been shown (see above). In addition, some second-degree relationships (eg, AV) were miscalled as being a different second-degree relationship (eg, half-sibling). We summarized the relationships as annotated by RELPAIR and based on prior annotation in a confusion matrix (Table 4), and we listed the RELPAIR annotation for each pairwise comparison in Supplementary Table 1.

Table 4 Confusion matrix for relationships inferred by RELPAIR compared with authentic relationships based on annotated CEPH three-generation pedigrees

To determine whether the amount of IBD2 estimated in these relationships impacted RELPAIR's ability to differentiate full-sibling from parent–child and second-degree relationships, we plotted IBS2*_ratio values versus K2 (Figure 4a) and the corresponding IBD2 estimate of PREST, pIBD2 (Figure 4b). We annotated these plots by the number of times RELPAIR called full-sibling per 25 trials. RELPAIR called 50 (out 317 parent–child relationships) of these pairwise comparisons a full-sibling relationship in ≥13 out of 25 instances with an average K2 of 0.043 ± 0.014 and a pIBD2 of 0.027 ± 0.017. Overall, 18 parent–child comparisons were called full-sibling less than half of the time with average values 0.021±0.006 for K2 and 0.013±0.015 for pIBD2. The increase in estimated IBD2 was correlated with the increase in RELPAIR assigning full-sibling status to annotated parent–child relationships with a linear Pearson correlation of r=0.892 for K2 and r=0.719 for pIBD2. Similar results were observed for elevated IBD2 in AV or grandparental relationships that were incorrectly assigned full-sibling status by RELPAIR. A higher correlation of r=0.839 was associated with full-sibling designation and K2 as opposed to r=0.632 for pIBD2. As IBD2 estimates increased, RELPAIR had a higher likelihood of misclassifying relationships as full-sibling. Furthermore, estimates for level of IBD2 using the kcoeff method were more consistent than those of PREST.

Figure 4
figure 4

Comparison of kcoeff to PREST and RELPAIR for k1 and k2, and comparison of kcoeff and PREST for k0. An IBS2*_ratio is plotted as a function of varying IBD estimates annotated by RELPAIR for k1 and k2 estimates with a comparison of k0 between kcoeff and PREST in parent–child relationships. (a) K2 (kcoeff) estimates of true AV, GG, and parent–child relationships incorrectly assigned a FS annotation from RELPAIR. Note the positive relationship between level of K2 and number per 25 trials of RELPAIR FS annotation. The color scale for panels a and b includes black (no RELPAIR FS calls) and ranges from blue (1 FS call per 25 trials) to red (25/25 FS calls). (b) pIBD2 (PREST) estimates of relationships incorrectly assigned a FS annotation from RELPAIR. (c) K1 estimates of unrelated and distantly related individuals assigned a CO annotation from RELPAIR. Note the positive relationship between level of K1 and number per 25 trials of CO annotation. (d) pIBD1 estimates of unrelated and distantly related individuals assigned a CO annotation from RELPAIR. (e) K0 estimates in parent–child relationships with selected pairs being identified. (f) pIBD0 estimates in parent–child relationships with selected pairs highlighted. Note the discrepancies between panels e and f. Also note that as there were duplicate samples present, it was not possible to fully annotate all relationships (see arrow 1 in panel f). The x and y-axis scales are the same for panels a/b, c/d, and e/f. Arrows indicate specific pairwise comparisons (see text for details).

RELPAIR also annotates first-cousin relationships.24 We analyzed all pairwise relationships that were annotated as unrelated, plotted IBS2*_ratio values versus estimates of K1 and pIBD1, and annotated by assignment of first cousins by RELPAIR (Figures 4c and d) per 25 trials. Note than some 1/4th sharing relationships with the lowest K1 and pIBD1 estimates were occasionally called first cousins and comprised a small minority of relationships designated as first cousins (data not shown). A total of 29 comparisons were designated as first cousins in a majority of RELPAIR runs (≥13/25) with an average K1 of 0.092±0.004 (Figure 4c) and pIBD1 of 0.063±0.060 (Figure 4d). Additionally, 34 comparisons were called first cousins at least once for values of 0.034±0.016 for K1 and 0.007±0.011 for pIBD1. The level of K1 in individuals annotated as unrelated had a correlation value of r=0.853, whereas pIBD1 estimates were r=0.761. This further supports the kcoeff estimation of IBD level as being more accurate than the maximum likelihood approach of PREST.

We also compared our IBD0 estimates in parent–child relationships to those of PREST. Using the kcoeff method, we detected IBD0 in four parent–child relationships (Figure 4e; NA11039/NA13056 (duplicated 13184), see red circle; NA12456/NA13133, see yellow circle; NA11036/NA11039, see green circle; and NA12615/NA12621, see orange circle). NA11039/NA13056 (duplicated 13184) and NA11036/NA11039 represent a trio, and we analyzed them using SNPduo to visualize the IBS0 (Supplementary Figures 2a and b). We observed extensive IBS0 in two regions between mother/daughter NA13056/NA11039 (Supplementary Figure 2a, regions 1 and 2 spanning 48 Mb on chromosome 14) and in one region between father/daughter pair NA11036/NA11039 (Supplementary Figure 2b, region 3 spanning 3 Mb on chromosome 14). Consistent with the K0 estimates, PREST's pIBD0 values indicated IBD0 in mother/daughter NA13184/NA11039. Surprisingly, pIBD0 was not detected in mother/daughter pair NA13056/NA11039 involving an identical comparison (Figure 4f; see arrows). A similar splitting of identical comparisons was found between father/son NA13180/NA13194 and father/son NA13055/NA13194 (Figure 4f) for whom we estimated similar, elevated K0 values for both comparisons. pIBD0 estimates by PREST were elevated for a group of pairwise comparisons (Figure 4f; arrow 1), which could not be fully annotated because some of the parental samples were duplicated. When these samples were manually annotated as parent–child, the pIBD0 estimates reverted to zero (data not shown) suggesting they were false positive results.

Discussion

The main finding of the present study is that CEPH pedigrees include previously unreported relationships both within and between pedigrees. As expected, we uncovered many regions of homozygosity due to autozygosity in the offspring of related individuals across a subset of 13 pedigrees that included 186 individuals. A subset of our findings was broadly consistent with a 1999 report by Broman and Weber25 that many CEPH individuals have extended regions of homozygosity. We describe homozygosity due to autozygosity in almost two dozen additional individuals, based on IBD (kcoeff), IBS (SNPduo) and homozygosity analyses. A recent report suggested that 10.4% out of the 6.7 billion people in the world have an inbreeding coefficient greater or equal to second cousins (F ≥0.0156).10 An estimation of inbreeding coefficients in our samples revealed that 21% had an F ≥0.0156. An additional 9% had a coefficient of inbreeding greater than third cousins (F ≥0.0039).

Our analysis of the amount of homozygosity observed in both inbred and outbred pedigrees suggests that very few individuals outside of inbred pedigrees have long homozygous regions. In fact, 111 out of 186 individuals had no regions of homozygosity in segments >3 Mb and having more than 800 SNPs. Many of the ones in which we report segments of homozygosity were present in pedigrees that had inbreeding and support the finding from Broman and Weber.25 This assessment is somewhat contrasted by other studies that suggest that long homozygous segments are common in the human genome.28, 29, 30 However, our results indicate that small homozygous segments around 1 Mb in length are quite common in both inbred and outbred individuals with genome-wide totals in excess of 40Mb per average individual (data not shown). This result agrees with previous estimates of total homozygosity based on the length of each segment.31, 32, 33 Many of these and smaller regions may be present due to sharing of common haplotypes, which is not accounted for by the kcoeff method and is below the resolution that we report. This could result in higher total homozygosity levels that match findings from previous studies.

We benchmarked our results against two leading software programs. We reported many first and second degree relationships incorrectly called as full-siblings by RELPAIR and discovered a correlation between miscalls and unexpected IBD2 sharing. Full-sibling relationships can either be inferred from the proportion of IBD2 sharing alone or by the expected Cotterman coefficients of relatedness (ie, K0 (1/4) K1 (1/2), and K2 (1/4)). For parent–child pairs with a small amount of IBD2 that were called full-sibling by RELPAIR, we did not observe a relative increase in IBS0. We thus conclude that RELPAIR misclassified relationships that were atypical (ie, parent–child with IBD2 or second-degree with IBD2). PREST generated IBD estimates that were generally comparable to those of kcoeff for these relationships.

RELPAIR, PREST and kcoeff were comparable in detecting distant relatedness between individuals annotated as unrelated. RELPAIR was more consistent than PREST at detecting distantly related individuals. There was a higher correlation between number per 25 trials of RELPAIR cousin calls and the kcoeff K1 estimate of IBD1 relative to PREST's pIBD1, suggesting similar abilities. Although RELPAIR was not explicitly designed to be run 25 times in order to derive a consensus output,24 we did 25 runs in accordance with others’ usage of the program.27

As previously noted by the developers of RELPAIR, distinguishing between the three forms of second-degree relationships is difficult.24 In the present study, many false positive half-sibling relationships were called by RELPAIR, as well as false positive GG and AV amongst the second-degree relationships. Neither our method (kcoeff) nor PREST explicitly classifies second-degree relationships (although PREST can do so, given a genetic map specified by cM distance in the map file). Instead, they provide IBD estimates that can be used to infer relationships.

It is important to note a potential limitation of the kcoeff program that produced decreased K1 and K2 estimates for individuals with homozygosity due to inbreeding. Since the kcoeff method relies on the presence of IBS2* calls (AB/AB) based on the assumption of no inbreeding, homozygosity affects how the IBD states are assigned. This applies only to pairs of individuals in whom IBD1 or IBD2 sharing is present. For example, homozygosity present in one individual that is compared with a second individual with normal heterozygosity (eg, AA/AB calls) would result in zero IBS0 and zero IBS2* calls, leaving only IBS1 and concordant homozygous IBS2 calls (eg, AA/AA, which are not informative for kcoeff analyses). The IBD state of this region would be dependent on the flanking regions. This will have a minimal effect on kcoeff IBD estimates for relationships that have significant amounts of homozygosity, as indicated by our results.

Additionally, it is worth noting a unique ability of the kcoeff software. Estimating relatedness has typically been restricted to within-group pairwise comparisons because of the impact that different allele frequencies can have on IBD estimation. The software, kcoeff, allows for a comparison of between-group individuals because of the underlying ratio it uses to infer IBD. When two people share IBD1 or IBD2 within a given segment, their IBS0_ratio (IBS0/(IBS0+IBS2*)) for that window will be zero as IBS0 does not exist, but for individuals who are unrelated (and within the same mating population) their IBS0_ratio is centered on 1/3. However, for individuals who are from different groups, unrelated segments will have IBS0_ratios >1/3 because of the increase in IBS0 (ie, there are more differences between two members of different groups).16 IBD estimates from individuals belonging to two different groups will have reduced noise because there will be fewer regions of little variability between them to confound K1 estimates, as occurs when K1 estimates are slightly higher than zero.