Introduction

Molecular markers are increasingly being used to decide which of a group of animals or plants could possibly be a parent of a particular individual animal or plant and which can definitely be excluded as a parent (Ellstrand, 1984; Tammisola et al., 1994). Although equally applicable to animal populations, we shall, for ease of presentation, assume that we are dealing with a plant species which is outbreeding and diploid. In this paper we shall assume that only the male parent needs identification. We may or may not have information about the marker alleles of the female parent. A future paper will discuss the identification of both parents. Often the parentage of several different plants will be of interest and compromises will then be needed because, as we shall see, the best way to pool the DNA of the potential parents does depend on the genotype of the progeny plant.

Assuming no errors in the typing of the molecular markers and no germ-line mutations, a plant can be excluded as a possible parent if it possesses neither of the alleles of the progeny plant. Knowing the marker alleles of the female parent may provide further evidence leading to exclusion. All the plants to be tested will be referred to as potential parents, and all the plants that cannot be excluded on the basis of their marker alleles will be referred to as possible parents. We shall assume throughout that one of the potential parents is the real parent. This can clearly only be a reasonable assumption in controlled environments or with closed, isolated populations.

Determining the molecular markers at a locus for a large number of potential parents can be time-consuming and expensive. If there are many potential parents and the probability of exclusion on the basis of the marker alleles at the locus is high, then a preliminary screen based on pooling the DNA of groups of plants may substantially reduce the number of tests required. If the pooled DNA does not contain the required allele or alleles, then all the parents contributing to the pool can be excluded. If the pooled DNA does contain the required allele or alleles then we know that the pool contains at least one possible parent. We shall assume that a pool being positive, i.e. containing the required allele or alleles, does not provide any information about the number of possible parents in the pool but simply that there is at least one.

The pooling of the DNA is almost equivalent to the use of group screening for defective items described originally by Dorfman (1943) and Sterrett (1957), with more recent work by Balding et al. (1994) and Bruno et al. (1995) on the screening of clone libraries for rare ‘positives’. Our problem differs in that we know that there is at least one possible parent, the real parent, among the potential parents.

By pooling the DNA, we hope to reduce the number of tests required to identify all the possible parents. Until the penultimate section, only two stages of testing will be allowed and so the parents in pools not excluded by the first stage of testing will be tested individually. More complicated schemes will often be too costly in time and organization. We shall attempt to minimize, by choice of the size of the pool, the expected number of tests required. However, cost may depend on the number of tests possible on each electrophoretic gel as well as on the total number of tests. This may be a consideration in choosing between similar schemes.

In the next section, we derive an expression for the expected number of tests required using designs, allocations of potential parents to pools, in which each parent is allocated to a single pool. The optimal pool size will be derived in terms of the number of potential parents and the probability, q, that a potential parent, not the real parent, does not have the required allele or alleles to be a possible parent. Then, the exclusion probability, q, will be derived as a function of the genotype of the progeny and the frequencies of the marker alleles in the population. When the progeny plant is heterozygous at the marker locus and the marker alleles of the female parent are not known, the expected number of tests has to be averaged over the probabilities of the marker type of the female parent. Following this the single-replicate designs will be compared with designs including each parent in two, three or four different pools. Finally the use of sequences of molecular markers will be considered.

Derivation of the expected number of tests

If the number of potential parents, N, can be factorized as N=nk, then we can form n pools of k parents each. The probability that a pool will contain at least one possible, i.e. nonexcludable, parent will be (1−qk), where q is the probability that an individual parent can be excluded. All parents in such a pool will be tested individually and so, recalling that all the parents in the pool containing the real parent are bound to be tested, the expected number of tests is N if k=1, and N/k+k+(N−k)(1−qk) if k⩾2. The expected number of tests per potential parent is therefore E=1 if k=1, and:

The values of E for these single-replicate designs for a range of values of N and q are given in Table 1. The possible larger pool sizes omitted from the table were never optimal. Clearly, for these values of N, the testing of each individual parent (k=1) is to be preferred if the probability of exclusion, q, is less than q=0.5. When q approaches 1, the expected number of tests per potential parent approaches

which is minimized by n=k=√N. Therefore, if q is likely to be much above q=0.5, a reasonable choice for n and k is n=k=√N. There will be substantial savings in the number of tests if q is near q=1 with the proportional saving increasing with the number of potential parents, N.

Table 1 Expected number of tests per potential parent required in single-replicate designs for a range of values of number of potential parents, N, pool size, k, number of pools, n, and probability of exclusion, q

Calculating the probability of exclusion

The probability, q, that a particular plant can be excluded on the basis of an individual test depends on whether the progeny plant is homozygous or heterozygous at the marker locus, and also on the frequencies of the marker alleles in the population. Write f1, f2,..., fm, with

for the frequencies of the m marker alleles M1, M2,..., Mm in the population.

Consider first a homozygous progeny with marker genotype M1M1. The only potential parents that can be excluded by their genotype at the marker locus are those with no M1 allele and so q=(1−f1)2, whatever the marker genotype of the female parent. Equation 1 and Table 1 can therefore be used with q=(1−f1)2.

Consider now a heterozygous progeny, M1M2. All the other alleles, M3, M4,..., Mm, can be classified as a single allele, \(\overline{M}\), with frequency \(\overline{f}\)=f3+f4+...+fm=1−f1−f2. The marker genotype of the female parent, which may be known, must be M1M1, M1M2, M2M2, M1\(\overline{M}\)or M2\(\overline{M}\). Bayes' Theorem can be applied to the genotypes of the female parent and the progeny plant to obtain the probability of the genotype of the female parent given the genotype of the progeny plant as:

Assuming random mating with respect to the marker locus, the probabilities of the five genotypes for the female parent are shown in the second column of Table 2. The third column shows the male parent genotypes that would be excluded if we knew the female genotype and the final column the probabilities of these genotypes in the population.

Table 2 Probability calculations for a heterozygous progeny, M1M2

If the marker genotype of the female parent is known, then the last column of Table 2 can be used to provide the values of q for (1) and hence for Table 1.

If the marker genotype of the female parent is not known, then the expected number of tests (eqn 1) must be averaged over the distribution of q shown in Table 2. So (1) becomes:

where

Table 3 shows the expected number of tests for N=36, 64 and 144 when the pool sizes are k=½N½ and N½ for some possible sets of values for the marker allele frequencies f1, f2 and \(\overline{f}\). The only gains from pooling, E<1, occur when both of the alleles of the progeny plant are rare, f1=f2=0.05. In this case, a pool size of k=½N½ is slightly more efficient than k=N½.

Table 3 Expected number of tests per potential parent for a heterozygous progeny plant for varying number of potential parents N, pool sizes k, and marker allele frequencies f1, f2, \(\overline{f}\). Single-replicate with k=½N½ and N½

Similar calculations are possible when more than one progeny plant is available but the results will then depend on the individual genotypes of all the progeny.

Pools with parents replicated

So far, only pool designs with each potential parent in one and only one pool have been considered. The calculation of the expected number of tests, E, for designs with replicates, is complicated by the need to allow for the possibility that in a nonexcludable pool all but one of the parents can be excluded on the basis of information from other pools. The probability of this occurring depends on high-order association patterns of parents in pools and can only be expressed algebraically if there is so much replication that there will almost certainly be more testing than with individual testing of the potential parents.

Ignoring this complication, we shall assume that each parent is replicated equally often and does not occur in the same pool with any other parent more than once. As before, N is the number of potential parents, k the pool size, and q the probability of exclusion of a parent on the basis of an individual test. The number of replicates of each parent will be r, so that the number of pools is:

The real parent and all other nonexcludable parents will require further individual testing and the expected number of these further tests is:

Any of the excludable parents in a pool containing the real parent will be further tested if all of the other (r−1) pools containing it include at least one possible parent. This has probability:

and contributes an expected number of further tests:

Similarly, the expected number of further tests for excludable parents not in a pool with the real parent, is:

Adding eqns 3, 4 and 5 and the first-stage testing of the n pools and dividing by N, gives the expected number of tests per potential parent as:

Setting r=1 reproduces, as it must, formula 1. As before, the expected number of tests (eqn 6) will need to be averaged over the distribution of q in Table 2 if the progeny plant is heterozygous and the marker genotype of the female plant is not known.

Designs do not exist for all combinations of N, r and k. The simplest two-replicate designs, (r=2), occur when N is a perfect square. As a simple example, when N=9 we can write the number of each potential parent in a square array:

and form six pools of three parents each by using the rows and columns of the square. Three-replicate designs can similarly be formed if N is a perfect cube. These designs with k=N½ and k= N 1 3 are special cases of lattice designs (Cochran & Cox, 1957) and larger numbers of replicates can be obtained. In Table 4 all these designs are listed for N=36, 64 and 144 with r≤4. With r>4, the number of pools is greater than 4N½ or 4 N 2 3 and little would be gained over individual testing. Table 4 includes the one other design available for N=36 which has r=2, k=8 (Bose et al., 1954).

Table 4 Approximate expected number of tests per potential parent for designs with replication compared with the ‘best’ single-replicate designs and designs with no pooling

Table 4 shows the expected number of tests per potential parent for the various designs when the progeny plant is homozygous, q=(1−f1)2, or is heterozygous and the female parent's marker genotype is known so that q is one of the values in the final column of Table 2. Table 4 shows, as expected, that replication is only worthwhile compared with single-replicate designs in terms of reducing the number of further tests required if the probability of individual exclusion, q, is large, for example greater than 0.8. Even then the savings are only appreciable, about 20 per cent, with the largest value of N studied, when three- or four-replicate designs are best. The designs with replicates may have an advantage in that they may provide some check on the occurrence of errors in classifying pools as excludable or not.

Table 5 gives a few examples of the expected number of tests when the progeny plant is heterozygous and the marker type of the female parent is unknown. Pooling is still preferable to no pooling, E<1, only when both alleles in the heterozygous progeny are rare, f1=f2=0.05. Additional replication does not reduce the number of tests per parent by a worthwhile amount.

Table 5 Expected number of tests per potential parent for a heterozygous progeny plant for varying number of potential parents, N, pool sizes, k, and marker allele frequencies f1, f2, \(\overline{f}\). Female parent marker alleles unknown. Varying number of replicates, r, with k=N½

Some checks can be made on the importance of ignoring the possibility of identifying a possible parent because all other parents in a pool containing it have been excluded on the basis of evidence from other pools. If the only possible parent is the real parent, then, for a single-replicate design, the expected number of tests per potential parent is:

and for a two-replicate design (k=N½, n=2N½), the true parent will be identified without further testing and so:

If there is just one possible parent in addition to the real parent, then, with a single-replicate design, the probability that the two possible parents are in the same pool is (k−1)/(N−1) leading to:

With a two-replicate design four individual tests are needed if the two possible parents are never in the same pool, leading to:

Equations (7) and (9) show that, at least for N≤100, E is generally minimized for single-replicate designs by taking n=k=N½. The only exception is when there is an additional possible parent and N=36; then n=9, k=4 is slightly preferable to n=k=6, with E=16.7 compared with E=17.1. With k=N½, single- and two-replicate designs have the same E-value when the only possible parent is the real parent.

Table 6 shows the values of E from eqns (7), (9) and (10) for a range of values of N with k=N½. The two-replicate designs are more efficient than the single-replicate designs when there is a possible parent in addition to the real parent. The advantage increases to 20 per cent when N=100.

Table 6 Expected number of tests per potential parent, E, one- and two-replicate designs with k=N½

Using several molecular markers

An alternative to replication is to test those pools not excluded using one molecular marker by using a second molecular marker, and those pools not excluded by the second marker by a test using a third marker, and so on. We shall assume a single-replicate design for the N potential parents with n pools of k plants each, N=nk. We shall assume that there are no correlations in the occurrence of particular alleles at different molecular marker loci. As potential parents are eliminated, the optimal pool size k will change. The resulting changes in the number of tests required would probably not be sufficient to compensate for the extra work involved in reconstructing the pools. The pools will therefore be kept intact.

Writing Pi as the probability that a particular pool not containing the real parent would be excluded by the ith molecular marker, the probability that the pool would be excluded by at least one of the first l markers is (Feller, 1968):

The expected number of tests of the pool using up to M molecular markers will be:

If all individuals in the pools not excluded by the M markers are tested individually using all M markers, the expected total number of tests for the (n−1) pools not containing the true parent is:

The pool containing the true parent will require (M+kM) tests. Thus the expected total number of tests per potential parent is:

With no pooling, k=1 and the expected total number of tests per potential parent is:

The expected number of possible parents at the conclusion of the testing is:

where qi is the probability of a potential parent being excluded on the basis of alleles at the marker i locus.

If the ith marker locus in the progeny plant is homozygous, Pi in (11) is: where fi1 and, later, fi2 and \(\overline{f}\)i now refer to the frequencies of the marker alleles M1, M2 and \(\overline{M}\) at the ith marker locus. If the ith marker locus in the progeny plant is heterozygous:

where as in eqn (2):

if the marker genotype of the female parent is unknown. The value of qi can be taken from the last column of Table 2 if the maternal marker type is known.

Table 7 shows the performance of a three-locus system with 64 possible parents for a variety of frequencies for the alleles of the progeny plant; for example, 0.05 represents a homozygote with allele frequency 0.05 and 0.05/0.2 a heterozygote with allele frequencies 0.05 and 0.2. The genotypes of the maternal parent are assumed not known. The order of testing and k, the size of the pool, have been chosen to minimize the expected number of tests. The fifth column shows the expected number of tests per parent; the sixth column the expected number of tests with no pooling, k=1, and the seventh column the expected number of possible parents after the testing. Table 8 shows comparable values for single-locus and two-loci testing.

Table 7 Performance of three-marker loci. Test loci in optimal order and pool size optimal, N=64
Table 8 Performance of one- and two-marker loci. Loci in optimal order and pool size optimal, N=64

Table 7 shows that pooling has considerable advantages in terms of number of tests required unless the allele frequencies are relatively high. The overall, and unsurprising, advantage of rare alleles is clear in terms of both the expected number of tests and the expected number of possible parents remaining after the tests. The latter is independent of the number of parents in each pool.

Table 8 shows the same general features noted for Table 7. Comparing the results of Tables 7 and 8, the two-locus tests incur more tests per parent than a single-locus test but at considerable savings in the number of possible parents remaining at the end of the tests. The same is true for the comparison of three-loci tests in place of two. The loci for which the progeny plant is homozygous and the loci for which the progeny plant has the rarest allele should be tested before the other loci. The optimal size of the pool for a single locus is four for rare alleles and one for the commoner alleles. For two loci, the optimal pool size can also be two for intermediate or mixed-allele frequencies. The optimal size of pool with three loci is often four, and only one when all the alleles possessed by the progeny plant are relatively common.

Discussion

There are clear advantages in pooling the DNA of the potential parents if the number of such parents is large and the alleles found in the progeny are rare. A good rule, whether the parent of a single progeny plant or animal or the parents of several different progeny are sought, is to choose a pool size close to ½N½, where N is the number of potential parents. There are considerable advantages in the sequential use of different markers in terms of reducing the number of possible parents remaining at the conclusion of the tests. Unless the number of potential parents is very large, there is little advantage in including each potential parent in more than one pool.