Introduction

Deletions are one of the many types of mutations that can affect a genome. The observed size of a deletion range from a single base pair up to an entire arm of a chromosome (see Lewis1). In recent years, there has been an increasing interest in investigating structural variants, including deletions.2 One reason is because deletions may be the cause of some diseases. One such case is the occurrence of somatic deletions involved in cancer, which can be detected through ‘loss of heterozygosity’ in the tumour when compared to other tissues.3 Another situation where deletions may be observed is when they occur as de novo deletions. These are notable when they subsequently cause disease in the offspring of the person within whom the deletion occurred in the germline.4 Finally, deletions causing disease may be inherited. Such deletions may act as directly causing in one end of the spectrum, but may also act as risk alleles in complex diseases.5, 6

However, just as in the case with other types of mutations deletions may have no phenotypic effect.7, 8 Alternatively, the effect is so weak that it is effectively undetectable. Such mutations may then appear as polymorphisms within populations. The presence of deletion polymorphisms in the human genome have recently been investigated in a number of large-scale projects.5, 9, 10, 11, 12 These investigations demonstrate that a large number of deletion polymorphisms occur in humans. They also show that, although with few exceptions the deletion constitutes the rare allele, the frequency of the deletion may be relatively high. The presence and frequency of deletion polymorphisms is of interest for a number of reasons. In certain eukaryotic lineages such as in unicellular fungi, genome size is clearly under selective pressure.13 A small genome size is assumed to have been of adaptive advantage due to its ability to replicate faster. Such evolution can only occur in the presence of deletions as the basic mutational events. Thus, the genomic positions and allele distribution of deletions are of evolutionary importance.

In order to study deletions, it is necessary to accurately detect them. Small deletions of only a few bases can be detected by sequencing. However, deletions that cover the entire amplified product to be sequenced will not be detected since an individual that is heterozygous for the deletion will simply appear homozygous for the sequence. Such deletions could be detected using methods that measure the amount of DNA in specific chromosomal regions, like PCR-based, or hybridization-based methods (eg MLPA14 and CGH arrays15). Another approach is to apply genetic methods that infer the presence of deletions from the pattern of marker segregation in families. This approach has been employed to detect both de novo and inherited deletions. For example, marker segregation was used to detect de novo deletions in the RB1 locus in patients with sporadic, bilateral retinoblastoma.16 Inherited deletions were detected in autism kindreds using microsatellite markers17 and in protein S deficiency using both microsatellites and SNPs.18 The fact that analysis of segregation is one effective method to detect deletions has spurred theoretical investigations into the efficiency and power of these methods.19, 20 Amos et al21 has modelled the probabilities of different configurations of a trio including de novo deletions, genotyping errors and also inherited deletions. In an earlier paper,22 we described how deletions causing a dominant disease can be detected. Two recent papers present methods to infer copy number polymorphisms using quantitative measures of alleles and family information from nuclear families23 and trios.24 Kohler and Cutler25 present a method to detect deletions using SNP data on trios. Their approach is to estimate the frequency of a deletion using Mendelian inconsistencies, departures from Hardy–Weinberg proportions and unusual patterns of missing data.

Here, we investigate a method to detect deletions without phenotypic effect by examining the segregation pattern of genetic markers. This approach was recently used in two large-scale projects.5, 10 The method is based on the fact that deletions act as null alleles with respect to marker loci in the deleted region. Null alleles with other causes than deletions are also of interest since undetected null alleles may cause errors in haplotype inference, parentage and population genetic analyses. The method is also relevant for searching for deletions involved in complex disease. Although a phenotypic effect is involved in these cases, the association between genotypes and phenotypes are usually sufficiently weak so that the methods of the present paper are more relevant than the methods presented in Johansson et al22 Thus, in the case of a complex disease, we suggest the following: first, search for deletion independently of phenotype and then, if a deletion is found, investigate the co-segregation between the disease and the deletion.

We have derived the probability to detect the existence of a deletion as a function of family structure and allele frequencies both of the marker alleles and the deletion. These probabilities were used to compare the efficiency of detecting deletions in different pedigree structures where grandparents or children were added to a trio.

Methods

Definitions and assumptions

In the following, we will consider a null allele at a genetic marker to be one that consistently fails to produce a detectable product or phenotype. We assume that dosage cannot be determined such that a heterozygote for a null allele is not distinguishable from a homozygote for the other allele. Further, we assume that there are no additional genotyping errors. We also assume that all null alleles present in the families are inherited, ie they are not due to de novo mutations within the pedigree. Null alleles can then be detected as a special case of apparent non-Mendelian inheritance; a parent and child will appear to be homozygous for different alleles (Figure 1). This event is denoted by C (for ‘confirmed’). Note that the parent–offspring combination can be anywhere within the pedigree. In this paper, we do not consider the event of an individual with no detectable phenotype (missing value) as evidence of a homozygote for a null allele; such a result could also be due to bad quality DNA and technical mistakes during the genotyping process. In the calculations below we will consider one locus with the marker alleles A1, A2, …, Am and a null allele A0. The allele frequencies will be denoted by p0 for the deletion/null allele and pi, i=1,…, m are the frequencies of the marker alleles. For a biallelic marker, p1 and p2 denotes the frequencies of the two marker alleles. Some analyses are extended to two bialleleic loci (A and B), which are encompassed by the same deletion. The proportions of the five haplotypes will then be denoted by P11, P12, P21, P22 and P00, where Pij is the proportion of the AiBj haplotype and P00 is the allele frequency of the deletion.

Figure 1
figure 1

Illustration of actual genotypes and marker phenotypes and in a trio with a deletion. Citation marks indicate a misinterpreted genotype.

Calculations and comparisons

As mentioned above, the criterion for confirming a deletion is to observe a parent and a child who are homozygous for different alleles. Such a pattern will, along with other kinds of deviation from Mendelian inheritance, be readily detected by a program such as PedCheck.26

Expressions for the probability of C, P(C), were derived for a number of different family structures as a function of allele frequencies for the deletion and the marker alleles (Appendix 1). It is trivial that P(C) is higher in a larger family than in a smaller family, but larger families also require more individuals to be analysed. Thus, the relevant comparison is for a fixed total number of sampled individuals. We wanted to investigate which sampling design is most efficient, ie would give the highest probability to detect a null allele at a certain frequency in the population in at least one family. The probability to confirm a deletion in at least one family is calculated as 1−(1−P(C))N, where P(C) is the probability to confirm a deletion in the family types respectively and N is the number of that family type.

Two series of comparisons were made. In the first comparison, the effect of adding parents and grandparents were investigated. In the second comparison, we investigated the effect of adding additional children to a trio, by using nuclear families with a varying number of children.

Comparison I: adding parents and grandparents

The probability to detect a deletion for a fixed overall number of investigated individuals was calculated for SNPs and multiallelic markers with varying allele frequencies. The family types are: one parent and one child (in this paper referred to as a duo), a trio, a trio with one grandparent (a tetra) and a trio with two grandparents (a pento). The family types can be seen in Figure 2. To investigate which family type that is most efficient, we made comparisons assuming a total sample size of 120, corresponding to 60 duos, 40 trios, 30 tetra and 24 pento. The results are illustrated for deletion frequencies of 0.05 and 0.01. In spite of the comparisons being made for a specific number of investigated individuals, it has full generality (Appendix 2).

Figure 2
figure 2

The family configurations used when studying the effect of adding parents and grandparents.

Comparison II: adding children

The probability to detect a deletion for a fixed overall number of investigated individuals was calculated for SNPs. The probability to confirm a deletion was illustrated for a deletion frequency of 0.05 and a total of 120 individuals divided into 40 families with one child, 30 families with two children, 24 families with three children and 20 families with four children. The probability was also illustrated for a deletion frequency of 0.01 and a total of 420 individuals divided into 140 families with one child, 105 families with two children, 84 families with 3 children, 70 families with four children and 60 families with five children.

Results

Comparison I: adding parents and grandparents

The probabilities of confirming a deletion with allele frequency p0 in the different pedigree structures using multiallelic markers are as follows:

Duo (one parent and one child).

A deletion will be detected if the sampled parent has genotype AiA0 and the child inherits A0 from this parent and Aj (ij) from the unsampled parent. The probability for this event considering one allele is 2pip00.5(1−pip0). The total probability summed over all alleles is . (Note: here and in all expressions below summation over i means for i=1,…, m.) This can be simplified to

Trio (two parents and one child) give the following expression (Appendix 1 for derivation)

The probabilities to detect a null allele in a trio with one or two grandparents are as follows:

for the tetra (trio with one grandparent) and

for the pento (trio with two grandparents). The double sums, , mean summation over i=1,…, m and j=1,…, m but skipping the cases when i=j.

For biallelic markers, such as SNPs, the expressions above reduce to

For biallelic markers, it is possible to formally show which family type is most efficient. With a fixed total sample size of k, the probability to detect a deletion in at least one family is 1−(1−P(C))N where N=k/n, and n is the number of individuals in a family. Comparing for example duos with trios then is to determine which of 1−(1−P(Cduo))k/2 and 1−(1−P(Ctrio))k/3 that is largest. The actual comparisons can be seen in the Appendix 2.

The results for SNPs show that for a fixed total number of individuals, a sample of pentos is always more efficient than a sample of the other family types, despite a lower total number of founder chromosomes in the sample. We define founder chromosomes as chromosomes that are not inherited from an individual in the pedigree. The second most efficient family type is the tetra, while the trio is better than the duo. This is illustrated in Figure 3a and b where the probability to detect a deletion is plotted for deletion frequencies of 0.01 and 0.05, respectively. Thus, every addition of one ancestor increases the probability, even if the improvement is small, as when adding grandparents to a trio.

Figure 3
figure 3

The probabilities to detect a deletion in at least one family as a function of the minor allele frequency for a total sample size of 120 (corresponding to 60 duo, 40 trio, 30 tetra, or 24 pento). (a) The frequency of the deletion is 0.01. (b) The frequency of the deletion is 0.05.

Some numerical examples of the results for multiallelic markers can be seen in Table 1. Rows 1–4 and 5–8 shows 2–5 equifrequent alleles and a deletion frequency of 0.01 and 0.05, respectively. The probabilities for detecting a deletion for multiallelic markers are higher than for biallelic markers. More alleles give higher probabilities. The probabilities for trio, tetra and pento are always very similar and substantially higher than for a duo. Which family configuration is best varies with the number of alleles and allele frequencies. A general trend where trios are best for many alleles can be seen, as well as that pentos are best when one allele is very common.

Table 1 Probabilities to detect deletions using multiallelic markers

Comparison II: adding children

For a nuclear family with b children, a deletion can be detected when at least one parent and at least one child will appear to be homozygous for different alleles. The probability for this is (Appendix 1 for a derivation):

For a biallelic marker this reduces to

where b is the number of children. With an infinite number of children Equation (10) approaches the limit 4p1p2p0(3−p0). Note that the probability to confirm a null allele does not approach one when p0 approaches one. The reason for this is that individuals with no detectable genotype are discarded from the analysis.

The effect of varying the number of children in the expression above was investigated numerically. Adding the first extra child to a trio increases the efficiency per investigated individual, adding the second does for some allele frequencies, but adding more children makes the family structure less efficient (Figure 4a and b). The optimal size of a nuclear family is thus one with two or three children. The probability to detect a least one deletion is very similar for two and three children, with a higher probability for two children for uneven allele frequencies and higher probability for three children with more even allele frequencies (Table 2).

Figure 4
figure 4

(a) The probabilities to detect a deletion in at least one family as a function of the minor allele frequency for a total sample size of 120 (corresponding to 40 families with one child, 30 families with two children, 24 families with three children, or 20 families with four children) when the frequency of the deletion is 0.05. (b) The probabilities to detect a deletion in at least one family as a function of the minor allele frequency for a total sample size of 420 (corresponding to 140 families with one child, 105 families with two children, 84 families with three children, 70 families with four children, or 60 families with five children) when the frequency of the deletion is 0.01.

Table 2 Probabilities to detect deletions using biallelic markers

Adding grandparents or children: which is better?

To answer this question, we compared the probabilities to detect a deletion for a biallelic marker for nuclear families with two or three children with a tetra and a pento, respectively. The probability for a biallelic marker in a single family is higher for a nuclear family with two children than for a tetra. The probability is thus also higher given a fixed sample size since the number of individuals in each family is the same. Comparing a nuclear family with three children with a pento in the same way gives the result that a nuclear family with three children is more efficient for detecting deletions than a pento (Table 2). The conclusion is thus that it is more efficient to add children than grandparents.

Multiple loci

So far only single marker analysis has been considered. Since a deductive method rather than inferential is under investigation here, genuine multipoint analysis does not exist as such. The information from several loci that do not confirm the existence of null alleles cannot be combined to confirmation in the sense above. If two adjacent markers confirm the presence of a null allele simultaneously, then it can be concluded that a deletion encompass both markers. If more than one of the investigated markers are situated within the deleted region, it is a higher probability that at least one marker will confirm a null allele. For this reason, two aspects of the simultaneous analysis of two markers have been considered, the probability that at least one marker will confirm a null allele, P(C, ≥1), and the probability that both will confirm null-alleles and thereby a deletion, P(C, 2). This is done for two biallelic markers in the duo and trio family structure, respectively.

For a duo, the probability of detection of a null allele at one locus or both is

And the probability of detection at both loci is

For a trio, the corresponding probabilities are

and

The derivation of Equations (11–14) is in Appendix 1. The ratio between P(C, ≥1) and P(C, 2) can take any value between 0 and 1. One obvious result is that, if linkage disequilibrium (LD) is complete in the sense that either P11=P22=0 or P12=P21=0, then P(C, ≥1)=P(C, 2), ie detection always occurs simultaneously for both markers. If on the other hand LD is zero and allele frequencies are equal, P(C, 2)/P(C, ≥1)=1/3. On the other hand, under these conditions, P(C, ≥1) takes its maximum value. Since closely situated markers often are in LD and markers are selected for high minor allele frequencies, the probability of simultaneous detection will be relatively high.

Effect of missing data

The calculations above are derived under the assumption that no data is missing. The effect of missing data varies among the different family types. If it is assumed that γ is the probability of missing one genotype in one individual the probability of a complete duo is (1−γ)2≈1−2γ, if γ is small. Thus, the probability of an incomplete duo is 2γ. In either case, whether it is the parent or the child that is missing, a duo will be lost leaving just a single individual. In the case of a trio, the probability of loss is 3γ in analogy. However, if either of the parents is lost, a duo will remain. The probability of a completely lost family is only γ. For a tetra, the probability of a remaining trio is γ, a three generation trio is γ (ie grandparent, parent and child), and a duo will remain with the probability 2γ. For a pento, a tetra will remain with the probability 3γ, a trio with the probability γ and a duo with the probability γ. Thus, when individuals are lost in the larger family types, the remaining individuals will form an informative family, a fact that emphasizes the relative efficiency of larger families.

Discussion

Every method to detect deletions has advantages and disadvantages. Because of this, physical and segregational methods have been combined to support each other in the search for deletion. For example, this was the strategy in the investigation by Conrad et al5 and McCarroll et al.10 One distinct advantage of using markers in families is that it directly allows for deduction of marker haplotypes outside the deletion, on the chromosome carrying the deletion. If more than one of the investigated individuals is found to carry a deletion within a certain region, the haplotype for the adjacent markers can be used to investigate whether the deletion has a single origin, or if the same region has been deleted several times. Moreover, if a deletion is found to have a single origin, the extended haplotype can be used to estimate the age of the deletion.

An important aspect is that in many cases, the marker data already is available. These data might have been generated for purposes such as family-based association, linkage mapping or characterization of the pattern of LD in a region. If a set of markers has been typed in families, these data should be further analysed for possible deletions or other null alleles. Merely discarding data that does not fit Mendelian inheritance might lead to loss of important information. As an illustration, Amos27 showed that the pattern of missing values can carry important information. Amos27 found that missing values for microsatellites had the same value for reconstructing a population tree as the data itself. This implied that in the dataset he analysed, a large proportion of the missing values were due to null allele homozygotes. When a single marker indicate a null allele, it is not possible to distinguish between a deletion and other causes. This may be considered a problem if the goal is to identify deletions. However, this information is vital if inference methods are used to detect null alleles and estimate their frequencies.

The probability to detect a null allele for a single marker using segregation in pedigrees is higher than the probability to get departure from Hardy–Weinberg equilibrium using the same number of unrelated individuals (AM Johansson unpublished data). Therefore, family data are valuable for detecting deletions.

A rather strong assumption underlying the analysis is that of perfect genotyping. This is of course not the case in a real investigation. The limitation imposed by this assumption varies with the conditions. If the marker method is combined with a physical method as in McCarroll et al,10 it is a very limited problem. A deletion is then confirmed when the two methods both indicate a deletion. If several adjacent markers indicate a deletion, its existence can be inferred with great certainty without independent confirmation. The most critical situation is then when a single marker in a single generational step indicates a null-allele. Then, genotyping errors, deletions and other null alleles cannot be distinguished. This problem can be accounted for in a number of ways. One is to introduce probabilities for genotyping errors. Such an approach is presented for trios by Amos et al21 A problem with this method is that it relies on estimates of genotyping errors. An alternative way would be to re-type interesting SNPs both with the same set of primers and with newly designed primers. If the pattern is repeated with the original primers, a genotyping error can be excluded whereas differentiation between deletions and other null-alleles is not possible. If newly designed primers also indicate a null-allele it is with great certainty a deletion.

An important problem when planning a study is whether it is more efficient to increase the size of the pedigrees by adding grandparents or to sample more children. We showed that addition of children to the pedigree is more efficient. This result is both surprising and encouraging, because it is much easier to collect information from additional children rather than a grandparent, since the grandparents may no longer be living. However, for other purposes, it might be better to have a design with grandparents since they would provide additional founder chromosomes and also make inferences about haplotypes more accurate.

The result that larger families are more efficient in spite of the overall smaller number of founder-chromosomes might seem surprising. However, it is only when a null-allele is transmitted from a typed parent to a typed child that detection is possible. Thus, the relevant comparison between experimental designs is to count the total number of transmitted chromosomes that can be observed. In a duo, only one transmitted chromosome can be observed per family. If the total number of investigated individuals is nT, then a total of 0.5nT transmitted chromosomes can be observed. If trios are used, nT/3 families can be studied but they represent two transmitted chromosomes each. Thus (nT/3)2≈0.67nT transmitted chromosomes are observed in total. In a tetra, there are three transmissions but only two of them definitely represent founder chromosomes. The third is a founder chromosome with the probability 0.5 and an already observed chromosome with the probability 0.5. A retransmission is of course less informative than the transmission of a founder chromosome but is still useful since it is possible to miss a deletion in the first transmission and detect it in the second. The relative value of a retransmission depends on the allele frequencies of the loci involved but if we arbitrarily assign a value of 0.5, the expected number of transmissions within a tetra is 2+0.5 × 1+0.5 × 0.5=2.75. Under this assumption, the expected total number of observable transmitted chromosomes using tetras will be (nT/4)2.75≈0.69nT. Using the same assumption, a pento represents 3.5-transmitted chromosomes and as a consequence the total number will be (nT/5)3.5=0.70nT. These calculations give an explanation of both why larger families tend to be more efficient and also why the largest jump in efficiency occurs between duos and trios. If the same calculation is made for a nuclear family with two children, the expected number will be 2+0.25 × 2+0.5 × 1.5+0.25 × 1=3.5, yielding a total of (nT/4)3.5≈0.88nT chromosomes, which is compatible with the observations above that adding children is more efficient than adding parents.