Introduction

This paper describes modifications to well-established maximum likelihood (ML) parentage-assignment methods to allow parentage to be assigned to pooled samples of multiple individuals—previously published ML methods are applicable to samples of single individuals only. This novel approach uses quantitative genotype data (also referred to as continuous genotype data; Clark et al. 2019) from low-density single nucleotide polymorphism (SNP) panels (Henshall et al. 2014). Quantitative genotype data are comprised of continuous numerical genotypes, reflecting the probabilities of all possible genotype classes (i.e., ordered allele combinations or unordered allele counts), in the form of genotype probability matrices or vectors. The use of quantitative genotype data for parentage assignment negates the need to call discrete genotype classes (Kalinowski et al. 2010; Marshall et al. 1998; Meagher and Thompson 1986). This is particularly important when assigning parentage to samples comprised of multiple individuals (i.e. ‘pools’), as it negates the need to call discrete genotype classes Marshall et al. 1998; Meagher and Thompson 1986). Genotype data from pools present like polyploid data and, as for polyploids, calling discrete genotypes is prone to error due to the large number of possible genotype classes at each SNP (Clark et al. 2019; Rahman et al. 2015).

Genetic markers are used for individual parentage assignment in pedigree-based selective breeding programmes: (i) where tracking identities over the life of individuals is difficult, expensive or not possible, a circumstance common in aquaculture species (Henshall et al. 2014; Kinghorn et al. 2010; Vandeputte and Haffray 2014); (ii) where only one parent is known with certainty, such as in the case of multiple-sire joining in livestock (Henderson 1988), poly cross families in trees (Burdon and Shelbourne 1971) and aggregated full-sib families in aquaculture (Hamilton et al. 2009); or (iii) for quality-control/identity-recovery purposes (Grattapaglia et al. 2014; Hansen and Kjaer 2006). Parentage assignment is also widely used in molecular ecology, including the study of conservation biology, dispersal and recruitment patterns, quantitative genetics and sexual selection (Flanagan and Jones 2019; Jones et al. 2010). A variety of genetic markers have been adopted for parentage assignment in selective breeding programmes and molecular ecology (Flanagan and Jones 2019; Jones et al. 2010). However, SNPs are increasingly used for this purpose, due to their capacity for high-speed high-throughput screening, low mutation rate, low genotyping error rate, high prevalence in the genome, and decreasing the cost of development and implementation (Hauser et al. 2011; Liu et al. 2016). A range of assays, platforms and algorithms are available for low-density SNP genotyping—most commonly using estimates of signal intensity (or area), for two axes, X and Y, corresponding to alleles A and B, from ‘intensity-based’ assays (Henshall et al. 2014; Rahman et al. 2015; Semagn et al. 2014) or counts of allele reads from genotyping-by-sequencing (GBS) assays (Clark et al. 2019), to call discrete genotypes.

Although declining, the cost of parentage assignment per individual precludes its application in many selective breeding programmes, and/or limits the circumstances or number of individuals to which it is applied. For example, further reduction in the cost of parentage assignment per individual would potentially allow: (i) pedigree-based selective breeding in additional species; (ii) an increase in the number of families and individuals able to be generated, retained and measured in current selective breeding programmes, with associated increases in the accuracy of selection; (iii) an increase in the number of environments across which progeny tests can feasibly be conducted; and/or (iv) the genetic analysis of additional traits (Bell et al. 2017; Henshall et al. 2014; Kinghorn et al. 2010; Vandeputte and Haffray 2014). Pooling samples from more than one individual, and using appropriate genetic markers and algorithms to determine parental contributions to pools, are means of further reducing the cost of parentage assignment in selective breeding programmes and molecular ecology studies (Henshall et al. 2014; Kinghorn et al. 2010).

Two distinct approaches to pool construction and parentage assignment are addressed in this paper: (i) ‘pooling-for-individual-parentage-assignment’, whereby each individual in a pool is tagged or traceable and from a known and mutually exclusive set of possible parents; and (ii) ‘pooling-by-phenotype’, whereby individuals of the same, or similar, phenotype/s are pooled (Bell et al. 2017; Henshall et al. 2012; Kinghorn et al. 2010) into bins (i.e. classes) of, ideally, equal width to minimise heterogeneity in measurement error. Furthermore, a novel ML parentage-assignment method using quantitative genotypes that can be used to accurately assign parentage to pools is described. This approach uses the same rationale underpinning ML parentage assignment using discrete genotypes (Kalinowski et al. 2010; Marshall et al. 1998)—in that it determines the likelihood of obtaining a certain pool (or individual offspring) genotype, given a set of parental genotypes—but uses quantitative genotypes in the place of discrete genotypes. ML parentage assignment using quantitative genotypes (herein referred to as ‘quantitative ML’) is validated, using simulated data, and compared with three alternative methods—ML parentage assignment using discrete genotypes (‘discrete ML’); exclusion (Chakraborty et al. 1974); and solving the weighted least squares problem (Henshall et al. 2014; Kinghorn et al. 2010).

Materials and methods

Computation of quantitative genotypes using intensity-based assays

For intensity-based assays, a method to compute quantitative genotypes, extended from Henshall et al. (2014), is outlined in Fig. 1, with a worked example provided in Supplementary Materials 1. Methods to compute quantitative genotypes (Fig. 1)—that is, the parent genotype probability matrices (Gij) and pool unordered genotype probability vectors (gkj)—using GBS are detailed elsewhere (Clark et al. 2019).

Fig. 1: Inputs and parameters required to compute quantitative genotypes (bold borders) from intensity-based assays.
figure 1

Dark grey boxes represent samples, light grey boxes represent data inputs from intensity-based assays and unshaded boxes represent scalars, vectors and matrices derived from data inputs. Note that a ‘pool’ may be comprised of one individual (dashed arrow).

For intensity-based SNP assays, the mean and standard deviation of allelic proportions for each genotype of each SNP (i.e., individual allelic proportion parameters—refer to Fig. 1 of Henshall et al. 2014—and pool allelic proportion parameters) are required to account for normally distributed random errors in allelic proportions in the computation of parent genotype probability matrices and pool unordered genotype probability vectors (Fig. 1). These are estimated from individual allele intensities and individual genotype calls from tissue and DNA samples of individuals (i.e., not pools). Individual samples (i.e. individual parent samples or non-parent individual samples; Fig. 1) should be sourced from the same, or a closely related, population to that for which parentage assignment is being undertaken, so as to maximise the probability that data for relevant loci and alleles are captured.

Estimation of individual allelic proportion parameters

Herein the approach and notation used to compute quantitative genotypes from intensity-based assays, accounting for normally distributed random errors in allelic proportions, is extended from Henshall et al. (2014). Accordingly, for a given individual and SNP, the individual allelic proportion (pij) takes a value between zero and one, corresponding to the polar coordinate range from 0 to \(\frac{\pi }{2}\) (Henshall et al. 2014):

$$p_{ij} = \left\{ {\begin{array}{*{20}{c}} {\frac{{{\mathrm{tan}}^{ - 1}\left( {\frac{{a_{2ij}}}{{a_{1ij}}}} \right)}}{{\frac{\pi }{2}}}\,if\,a_{1ij} \ge 0} \\ {2 + \frac{{{\mathrm{tan}}^{ - 1}\left( {\frac{{a_{2ij}}}{{a_{1ij}}}} \right)}}{{\frac{\pi }{2}}}\,if\,a_{1ij}\, < \,0} \end{array}} \right.$$
(1)

where pij is the individual allelic proportion (for details refer to the Appendix of Henshall et al. 2014), and a1ij and a2ij are the individual allelic intensities for allele A and B for individual sample i and SNP j (adjusted for ‘uncertainty’ where applicable—refer to Eq. (1) of Henshall et al. 2014). Conceptually, individual allelic intensities should not be less than zero and individual allelic proportions should range between 0 and 1 but minor deviations outside these parameter boundaries can be accommodated by Eq. (1). Note that software associated with some genotyping platforms directly output a variable that is conceptually equivalent to individual (or pool) allelic proportion (e.g., ‘B-allele frequency’, refer to Bell et al. 2017).

For samples from diploid individuals, discrete genotype classes can be reliably called using proprietary or other software appropriate to a given genotyping platform (Henshall et al. 2014). Individual allelic proportion parameters, means (\(\overline x _{AAj}\), \(\overline x _{ABj}\) and \(\overline x _{BBj}\)) and standard deviations (sAAj,sABj and sBBj) by genotype, can then be estimated for each SNP from a sample of individuals. These individual allelic proportion parameters are specific to the genotyping assay, platform and method used.

Estimation of pool allelic proportion parameters

Pool allelic proportion parameters for each SNP can be approximated from individual allelic proportion parameters, as follows. First, means of allelic proportions for homozygous A (\(\overline x _{jA}\)) and homozygous B (\(\overline x _{jB}\)) genotypes may be assumed to equal \(\overline x _{jAA}\) and \(\overline x _{jBB}\), respectively, and sjA and sjB to equal sjAA and sjBB, respectively. Second, mean and standard deviation of allelic proportion for the heterozygous genotypes with equal counts of A and B alleles may be assumed to equal \(\overline x _{jAB}\) and sjAB, respectively. Finally, means and standard deviations of allelic proportion for heterozygous genotypes with unequal counts of A and B alleles may be weighted according to allele counts. For example, for a pool size of three (i.e., three individuals in a pool): \(\overline x _{jAAAAAB} \approx \frac{2}{3}\overline x _{jAA} + \frac{1}{3}\overline x _{jAB}\) and \(s_{jAAAAAB} \approx \sqrt {\frac{2}{3}s_{jAA}^2 + \frac{1}{3}s_{jAB}^2}\), \(\overline x _{jAAAABB} \approx \frac{1}{3}\overline x _{jAA} + \frac{2}{3}\overline x _{jAB}\) and \(s_{jAAAABB} \approx \sqrt {\frac{1}{3}s_{jAA}^2 + \frac{2}{3}s_{jAB}^2}\), \(\overline x _{jAABBBB} \approx \frac{2}{3}\overline x _{jAB} + \frac{1}{3}\overline x _{jBB}\) and \(s_{jAABBBB} \approx \sqrt {\frac{2}{3}s_{jAB}^2 + \frac{1}{3}s_{jBB}^2}\); and \(\overline x _{jABBBBB} \approx \frac{1}{3}\overline x _{jAB} + \frac{2}{3}\overline x _{jBB}\) and \(s_{jABBBBB} \approx \sqrt {\frac{1}{3}s_{jAB}^2 + \frac{2}{3}s_{jBB}^2}\). Note that only parameters for unordered allele combinations (i.e., ‘unordered genotypes’—genotypes corresponding to allele counts) are required. For example, the unordered genotype AAAAAB corresponds to the ordered genotypes of AAAAAB, AAAABA, AAABAA, AABAAA, ABAAAA and BAAAAA.

Alternative approaches to estimating pool allelic proportion parameters can be conceived. For example, they could be estimated directly from pools with known SNP genotypes.

Computing quantitative genotypes

For a given parent and SNP, each element of the parent genotype probability matrix (Gij) represents the probability that the true genotype is equal to the corresponding ordered genotype. Adopting the approach of Henshall et al. (2014), for each parent i and SNP j a quantitative genotype probability matrix (Gij) can be computed:

$${\boldsymbol{G}}^{{\boldsymbol{ij}}} = \left[ {\begin{array}{*{20}{c}} {\lambda _{ij_{AA}}} & {\frac{{\lambda _{ij_{AB}}}}{2}} \\ {\frac{{\lambda _{ij_{BA}}}}{2}} & {\lambda _{ij_{BB}}} \end{array}} \right]/{\sum} {\lambda _{ij}}$$
(2)

where λijAA is the height of the N(\(\overline x _{j_{AA}},s_{j_{AA}}\)) distribution at x = pij, λijAB and λijBA are the height of the N(\(\overline x _{j_{AB}},s_{j_{AB}}\)) distribution at x = pij, λijBB is the height of the N(\(\overline x _{j_{BB}},s_{j_{BB}}\)) distribution at x = pij and Σλij is the sum of λijAA, λijAB and λijBB.

Extending the method of Henshall et al. (2014), for each pool and SNP, a pool unordered genotype probability vector (gkj)—elements of which represents the probability that the true genotype is equal to the corresponding unordered genotype—can be computed by modifying Eq. (2) to accommodate the additional genotypes possible in pools:

$${\boldsymbol{g}}^{{\boldsymbol{kj}}} = \left[ {\begin{array}{*{20}{c}} {\lambda _{kj_1}} \\ \vdots \\ {\lambda _{kj_{2n + 1}}} \end{array}} \right]\begin{array}{*{20}{c}} {\, \circ } \\ {\,} \\ {\,} \end{array}\left[ {\begin{array}{*{20}{c}} {\rho _1^{ - 1}} \\ \vdots \\ {\rho _{2n + 1}^{ - 1}} \end{array}} \right]/\mathop {\sum}\limits_{q = 1}^{2n + 1} {\lambda _{kjq}}$$
(3)

where gkj is a vector with 2n + 1 elements, corresponding to the number of possible unordered genotypes for pool k and SNP j, n is the pool size, and ° is the Hadamard or entrywise product. The elements λkj1 to λkj2n+1 are computed as the height at the corresponding N(\(\overline x _{j_1},s_{j_1}\)) to N(\(\overline x _{j_{2n + 1}},s_{j_{2n + 1}}\)) distributions at x = pkj (where pkj is the allelic proportion for pool k, computed according to Eq. (1), replacing i with k). The elements \({\uprho}_1^{ - 1}\) to \({\uprho}_{2n + 1}^{ - 1}\) are computed as the inverse of the number of possible ordered genotypes for the unordered genotypes 1 to 2n + 1. For example in the case of two individuals per pool (n = 2): λkj1 = λkjAAAA = height of the \({\mathrm{N}}\left( {\overline x _{j_{AAAA}},s_{j_{AAAA}}} \right)\) distribution at x = pkj and ρ1 = ρAAAA = 1 (the one ordered genotype is AAAA); λkj2 = λkjAAAB = height of the \({\mathrm{N}}\left( {\overline x _{j_{AAAB}},s_{j_{AAAB}}} \right)\) distribution at x = pkj and ρ2 = ρAAAB = 4 (the four ordered genotypes are AAAB, AABA, ABAA and BAAA); λkj3 = λkjAABB = height of the \({\mathrm{N}}\left( {\overline x _{j_{AABB}},s_{j_{AABB}}} \right)\) distribution at x = pkj and ρ3 = ρAABB = 6 (the six ordered genotypes are AABB, ABAB, BAAB, ABBA, BABA, BBAA); λkj4 = λkjABBB = height of the \({\mathrm{N}}\left( {x_{j_{ABBB}},s_{j_{ABBB}}} \right)\) distribution at x = pkj and ρ4 = ρABBB = 4 (the four ordered genotypes are ABBB, BABB, BBAB, BBBA); and λkj5 = λkjBBBB = height of the \({\mathrm{N}}\left( {\overline x _{j_{BBBB}},s_{j_{BBBB}}} \right)\) distribution at x = pkj and ρ5 = τBBBB = 1 (the one ordered genotype is BBBB). Note that in the case of one individual per pool (n = 1), the elements of a pool unordered genotype probability vector are equal to the elements of the upper triangle of a corresponding quantitative genotype probability matrix for an individual offspring (i.e. gkj is equivalent to the upper triangle of Goj from Henshall et al. 2014, expressed as vector). Where data for a SNP are missing in a pool, values of gkj can be assumed to equal the expected unordered transmission vector (nnj), where nnj equals tcj, assuming tij equals fj for all 2n parents (detailed below).

ML parentage assignment using quantitative genotypes

Inputs

Required inputs to undertake quantitative ML are the parent genotype probability matrices (Gij) and the pool unordered genotype probability vectors (gkj), described above; and a list of possible parental combinations in the pool, an assumed genotype replacement error rate (\(\widehat {\varepsilon _{nj}}\)) and a critical Δ log odds (LOD) value (Fig. 2; Supplementary Material 1). The list of possible parental combinations in the pool may comprise all combinations of all possible parents but in many circumstances a shorter list can be considered, substantially reducing the computational burden. For example, in the case of pooling-for-individual-parentage-assignment, pools are constructed such that the list of possible parental combinations in a pool must include two, and only two, parents from each known and mutually exclusive set of possible parents contributing to the pool. The assumed genotype replacement error rate can theoretically range between zero and one but is generally small (e.g., <0.01). Critical Δ LOD represents a pedigree assignment acceptance threshold corresponding to a desired level of confidence in assignment (i.e., a predetermined acceptable proportion of erroneous parentage assignments)—the greater the critical Δ LOD, the lower the erroneous assignment rate in accepted assignments, but the greater the number of falsely rejected assignments (Harrison et al. 2013; Jones et al. 2010). Although computationally burdensome, critical Δ LOD is most appropriately determined by simulating populations of parents and pools (Jones et al. 2010; Kalinowski et al. 2010; Marshall et al. 1998), given an assumed genotype replacement error rate and possible parental combinations in pools.

Fig. 2: Inputs and parameters required to undertake maximum likelihood parentage assignment using quantitative genotypes.
figure 2

Light grey boxes represent data inputs and unshaded boxes represent scalars, vectors and matrices derived from data inputs. Note that the expected allele frequency vector may be estimated from the parent transmission vector or other sources (dashed arrow).

Computation of the LOD score

Intermediate vectors required to compute the LOD score (Henshall et al. 2014; Meagher and Thompson 1986), and ultimately assign parentage, include the parent allele transmission vectors (tij), parent combination ordered transmission vectors (qcj) and parent combination unordered transition vectors (tcj; Fig. 2). For a given parent, i, and SNP, j, each element of the 2 × 1 parent allele transmission vector represents the probability that the corresponding allele (i.e., A or B) will be transmitted to its progeny. The parent allele transmission vector (tij) for parent i and SNP j can be computed as (Henshall et al. 2014):

$${\boldsymbol{t}}^{{\boldsymbol{ij}}} = {\boldsymbol{G}}^{{\boldsymbol{ij}}}\left[ {\begin{array}{*{20}{c}} 1 \\ 1 \end{array}} \right]$$
(4)

When marker data for a SNP from a parent is missing, the corresponding parent allele transmission vector may be assumed to equal the vector of expected allele frequencies transmitted to pools for SNP j (fj). The expected allele frequencies are most simply estimated by computing the average of parental tij vectors, excluding missing data, for SNP j (Fig. 2). However, less generic, but more precise, approaches to the computation of expected allele frequencies can be conceived. For example, if expected parental contributions to pools are known, expected allele frequencies could be computed separately for each pool as the average of parental tij vectors weighted by their expected parental contribution to the pool.

For each possible combination of parents, c, and SNP, j, elements of the 1 × 2n parent combination ordered transmission vector (qcj; Fig. 2) represent the probability that the corresponding ordered genotype will be transmitted to progeny. The parent combination ordered transmission vector can be computed as the Kronecker product, , of t1j,…, t2nj. That is:

$${\boldsymbol{q}}^{{\boldsymbol{cj}}} = \mathop{\otimes}\limits_{{\boldsymbol{i}} = 1}^{2{\boldsymbol{n}}}{\boldsymbol{t}}^{{\boldsymbol{ij}}}$$
(5)

where 2n is the number of parents for possible parent combination c.

It is noteworthy that the length of qcj increases exponentially by 2n (i.e., twice the pool size). However, qcj vectors can be consolidated into parent combination unordered transition vectors (tcj) of length 2n + 1, substantially reducing the computational burden. For example, if the pool size is three (n = 3), elements corresponding to the ordered genotypes AAAAAB, AAAABA, AAABAA, AABAAA, ABAAAA, and BAAAAA in qcj can be summed with the result becoming the element in tcj corresponding to the unordered genotype AAAAAB.

In the implementation of previously described ML parentage-assignment methods, genotyping errors are accounted for by assuming they represent the replacement of true genotypes with genotypes selected at random under Hardy–Weinberg assumptions in a small proportion of SNPs (i.e., the assumed genotype replacement error rate) (Henshall et al. 2014; Jones et al. 2010; Kalinowski et al. 2010; Marshall et al. 1998). Such errors can be similarly accounted for in quantitative ML by adjusting the parent combination unordered transition vectors (tcj) and pool unordered genotype probability vectors (gkj):

$${\boldsymbol{t}}_ \ast ^{{\boldsymbol{cj}}} = \left( {1 - \widehat {\varepsilon _{nj}}} \right){\boldsymbol{t}}^{{\boldsymbol{cj}}} + \widehat {\varepsilon _{nj}}{\boldsymbol{n}}^{{\boldsymbol{nj}}}$$
(6)
$${\boldsymbol{g}}_ \ast ^{{\boldsymbol{kj}}} = \left( {1 - \widehat {\varepsilon _{nj}}} \right){\boldsymbol{g}}^{{\boldsymbol{kj}}} + \widehat {\varepsilon _{nj}}{\boldsymbol{n}}^{{\boldsymbol{nj}}}$$
(7)

where \(\widehat {\varepsilon _{nj}}\) is the assumed genotype replacement error rate for pool size n and SNP j.

For each pool with genotype probability vector \({\boldsymbol{g}}_ \ast ^{{\boldsymbol{kj}}}\), the likelihood of each possible parent combination and pool duo for each SNP can be computed as:

$$L^{\left( {ck} \right)j} = {\boldsymbol{t}}_ \ast ^{{\boldsymbol{cj}}}\,{\boldsymbol{g}}_ \ast ^{{\boldsymbol{kj}}^{\prime}}$$
(8)

To compute the likelihood ratio \(\left( {\frac{{L^{\left( {ck} \right)j}}}{{L^{cj}}}} \right)\), the denominator is computed as the likelihood under the null hypothesis that the individuals in the pool are unrelated to the possible parent combination in question:

$$L^{cj} = {\boldsymbol{n}}^{{\boldsymbol{nj}}}{\boldsymbol{g}}_ \ast ^{{\boldsymbol{kj}}^{\prime}}$$
(9)

The likelihood ratio across all markers can then be computed as the sum of the natural log of the likelihood ratios, herein referred to as the LOD score:

$${\rm{LOD}}^{ck} = \mathop {\sum}\limits_{j = 1}^m {{\rm{log}}_e\left( {\frac{{L^{\left( {ck} \right)j}}}{{L^{cj}}}} \right)}$$
(10)

where m is the number of SNPs.

Parentage assignment

The most likely parental combination for each pool can be identified as that with the greatest LODck. The Δ LOD for each parent in the most likely parental combination can then be computed as the difference between the LODck for the most likely parental combination and the maximum LODck of those parental combinations that were identical to the most likely combination except for the parent in question. If a parent’s Δ LOD is greater than critical Δ LOD it can then be accepted as correctly assigned.

Validation

An R package (R Core Team 2020) entitled ‘SNPpools’, available at https://github.com/mghamilton/SNPpools, was developed to implement and validate, with simulations, the quantitative ML method. Simulations were conducted for pool sizes of one (i.e., samples of individuals), two and three using the sim.parent.assign.fun function of the SNPpools package.

Generation of simulated datasets for validation and comparison

Two pedigree structures (‘Pedigree 1’ and ‘Pedigree 2’) were simulated. Pedigree 1 was constructed assuming eight dams, each crossed with eight mutually exclusive sets of ten sires (i.e. 80 sires). This pedigree structure is akin to the case of multiple-sire joining in livestock (Henderson 1988), poly cross families in trees (Burdon and Shelbourne 1971) and aggregated full-sib families in aquaculture (Hamilton et al. 2009), where the one parent is known and the other is known to be one of a finite set of possible parents. Pedigree 2 was constructed assuming 40 sires and 40 dams each crossed with one individual of the opposite sex to produce 40 full-sib families. This pedigree structure represents a mating design applicable to species where half-sib families are difficult to generate, such as shrimp (Dai et al. 2020). In both pedigree structures, all parents were assumed to be unrelated (scenarios involving half the number of possible parents/families and related parents are explored in Supplementary Materials 2 and 3, respectively).

Pedigree 1 pools were constructed to allow ‘pooling-for-individual-parentage assignment’ and for Pedigree 2 pools were constructed assuming a ‘pooling-by-phenotype’ strategy. That is, for Pedigree 1, pools were constructed ensuring that the progeny of each dam was represented no more than once in each pool. For Pedigree 2, no constraints on the ancestry of individuals placed in pools were applied—that is, individuals from families were allocated to pools at random. For each pooling strategy and pool size, parentage was assigned to the progeny of 9600 ‘unknown parents’.

Generation of simulated datasets for validation and comparison

For each simulated sample (parent or pool) and 100 SNP, individual allele intensities were generated (scenarios involving 50 SNP and 200 SNP are explored in Supplementary Materials 4 and 5, respectively). First, ‘true’ genotypes were randomly generated for parents assuming that SNPs were biallelic, SNPs were in linkage equilibrium (i.e., inherited independently) and the allele frequency for all SNPs was 0.5 (a scenario in which the allele frequency for all SNPs was 0.4 is explored in Supplementary Materials 6). Second, true genotypes for progeny were then generated from true parental genotypes assuming alleles were inherited according to Mendelian principles. Finally, pool genotypes were generated by concatenating the true genotypes of individuals in the pools (e.g., a pool of two individuals with genotypes AB and BB had a pool genotype of ABBB).

To account for lab-based genotyping errors, each SNP for each individual and pool was randomly categorised as correct (with probability 1 − εj) or erroneous (with probability εj), where εj is the true genotype replacement error rate for SNP j (Henshall et al. 2014). If categorised as erroneous, a random ‘observed genotype’, which may differ from the ‘true genotype’, was generated. For the purpose of simulation the true genotype replacement error rate was assumed to equal 0.01—a somewhat large and thus conservative value—for all SNPs and samples.

For each parent and SNP, individual allelic proportions (pij) were generated as random normal variables:

$$p_{ij} \sim {\mathrm{N}}\left( {\mu _{jq},\,\sigma _{jq}} \right)$$
(11)

where μjq is the mean of the individual allelic proportion for the observed genotype of parent sample i, SNP j and genotype q; and σjq is the corresponding standard deviation. Assumed true inidividual allelic proportion parameters for all SNP were as follows: μAA = 0.05, μAB = 0.50, μAB = 0.05, μBB = 0.95 and σBB = 0.05 (scenarios in which standard deviations—σAA, σσAB and σBB—were assumed to equal 0.01 and 0.20 are explored in Supplementary Materials 7 and 8, respectively). Individual allele intensities were subsequently calculated as follows:

$$a_{1ij} = \sin \left( {p_{ij} \times \frac{\pi }{2}} \right)$$
(12)
$$a_{2ij} = \cos \left( {p_{ij} \times \frac{\pi }{2}} \right)$$
(13)

Pool allelic proportions (pkj) and pool allele intensities (a1kj and a2kj) where correspondingly computed.

For each simulated parent and pool, each SNP was randomly categorised as missing (with probability 0.1) or present (with probability 0.9). For those SNPs categorised as missing, allele intensities (i.e., a1ij, a2ij, a1kj and/or a2kj) were deleted (scenarios in which all SNP from some parents were categorised as missing are explored in Supplementary Materials 9).

Parentage assignment

For each simulated pool, quantitative ML was undertaken considering only possible parental combinations, and the assigned parents compared with the true parents of each pool. For comparison, parentage assignment was also undertaken for each pool by extending three previously described methods (Henshall et al. 2014). First, a ML parentage assignment using discrete genotypes approach was implemented. This method, ‘discrete ML’, was equivalent to quantitative ML, except that the quantitative parent genotype probability matrix (Gij) was replaced with a discrete genotype probability matrix, Dij, where \({\boldsymbol{D}}^{{\boldsymbol{ij}}} = \left[ {\begin{array}{*{20}{c}} 1 & 0 \\ 0 & 0 \end{array}} \right],\left[ {\begin{array}{*{20}{c}} 0 & {1/2} \\ {1/2} & 0 \end{array}} \right]\), or \(\left[ {\begin{array}{*{20}{c}} 0 & 0 \\ 0 & 1 \end{array}} \right]\). To compute Dij, if an element the corresponding Gij was greater than 0.98, for homozygous genotypes (i.e., diagonal elements corresponding to genotypes AA and BB), or 0.49 for heterozygous genotypes (i.e., off-diagonal elements corresponding to genotypes AB and BA), were replaced with 1 and 1/2, respectively, with all other elements equal to zero (Henshall et al. 2014). In addition, also using a threshold of 0.98, the pool unordered genotype probability vector gkj was replaced with dkj. Secondly, an exclusion method was applied. Using this method, the number of genotype mismatches, computed as the number of SNPs where \({\boldsymbol{t}}^{{\boldsymbol{cj}}}{\boldsymbol{d}}^{{\boldsymbol{kj}}^{\prime} } = 0\), for each pool and possible parental combination was calculated. The possible parental combination with the lowest proportion of mismatched SNPs (and lowest standard error of proportion, where multiple combinations with the same proportion were identified) was then identified. Finally, parental contributions to pools were estimated by solving the ‘weighted least squares problem’, as detailed in Henshall et al. (2014). The possible parental combination resulting in the minimum sum of squared difference from these estimated parental contributions was then assigned to the pool. Simple worked examples of the ‘quantitative ML’, ‘discrete ML’, ‘exclusion’ and ‘least squares’ methods of parentage assignment are provided in Supplementary Materials 1.

Results

Simulations revealed that not only can the quantitative ML method be used to assign parentage to pools, but the quantitative ML method was more accurate than discrete ML and exclusion under every scenario examined and was generally more accurate than the least squares method (Tables 1 and 2; Figs. 3 and 4; Supplementary Materials 10). This indicates that the quantitative ML method may be adopted to reduce the cost and or increase the accuracy of parentage assignment in selective breeding programmes and molecular ecology studies in many circumstances.

Table 1 Proportion of correct assignments to 9600 simulated unknown parents for pools of one, two and three individuals using a pooling-for-individual-parentage-assignment approach.
Table 2 Proportion of correct assignments to 9600 simulated unknown parents for pools of one, two and three individuals using a pooling-by-phenotype approach.
Fig. 3: Overlapping histograms of Δ LOD values for assignments to 9600 simulated unknown parents of pools.
figure 3

Histograms were generated using a pooling-for-individual-parentage-assignment approach—of one (a, b), two (c, d) and three (e, f) individuals. Results using quantitative maximum likelihood (a, c and e) and discrete maximum likelihood (b, d and f) parentage assignment are shown—correctly assigned individuals are in grey and incorrectly assigned individuals are represented as a black line. Where applicable, the critical Δ LOD value to achieve a 99% correct assignment rate is indicated by an arrow with the percentage of rejected assignments in parentheses.

Fig. 4: Overlapping histograms of Δ LOD values for assignments to 9600 simulated unknown parents of pools.
figure 4

Histograms were generated using a pooling-by-phenotype approach—of one (a, b), two (c, d) and three (e, f) individuals. Results using quantitative maximum likelihood (a, c and e) and discrete maximum likelihood (b, d and f) parentage assignment are shown—correctly assigned individuals are in grey and incorrectly assigned individuals are represented as a black line. Where applicable, the critical Δ LOD value to achieve a 99% correct assignment rate is indicated by an arrow with the percentage of rejected assignments in parentheses.

Simulations, adopting quantitative ML assigned parentage for pools of one (i.e., assignment of parentage to individuals) without error under both the pooling-for-individual-parentage assignment and pooling-phenotype scenarios (Tables 1 and 2; Figs. 3 and 4). For pools of two, correct assignment rates were very high, 0.997 and 1.000, respectively, and for pools of three, the correct assignment rate was reduced to 0.917 under pooling-for-individual-parentage assignment but remained high under the pooling-by-phenotype scenario (0.982). For pools of three, the correct assignment rate was notably poor for methods relying on discrete genotype assignments—0.175 and 0.106 for discrete ML, and 0.048 and 0.157 for exclusion. However, discrete ML and exclusion are highly dependent on the accuracy of discrete genotype calls and our results, although indicative that quantitative ML is superior to these methods, are only relevant to the approach to genotype calling adopted in simulations. Furthermore, using quantitative ML, it was necessary to reject a substantial percentage of assignments (31%; Fig. 3e) to achieve a 99% correct assignment rate (critical Δ LOD = 4.96) for pools of three using a pooling-for-individual-parentage-assignment approach. This highlights the inherent limitations of using low-density SNP panels to assign parentage to large pools.

Discussion

Quantitative ML can be used to assign parentage to pooled samples using low-density SNP data and simulations showed it to be more accurate in assigning parentage to pools than other approaches—discrete ML (Kalinowski et al. 2010; Marshall et al. 1998); exclusion (Chakraborty et al. 1974), and solving the weighted least squares problem (Henshall et al. 2014; Kinghorn et al. 2010). In addition, unlike exclusion and solving the weighted least squares problem, ML parentage assignment allows a desired level of confidence in assignment to be specified, by defining a critical Δ LOD value below which assignments are rejected (Figs. 3c, e and 4e).

Two circumstances where implementation of parentage assignment to pools is applicable were identified— pooling-for-individual-parentage-assignment and pooling-by-phenotype. In the case of pooling-for-individual-parentage assignment applied to selective breeding programmes, individuals in each pool are tagged or traceable and are from a known and mutually exclusive set of possible parents. This approach to pooling makes the construction of the additive relationship matrix for identifiable (e.g., tagged) individuals possible and is particularly suited to reconstructing full-sib pedigree from multiple-sire joinings in livestock (Henderson 1988), poly cross families in trees (Burdon and Shelbourne 1971) and aggregated full-sib families in aquaculture (Hamilton et al. 2009). However, it is also suitable for the pooling of samples from different rounds of selection or mating, different selective breeding populations (e.g., different hatcheries or seed orchards) or selective breeding programmes involving related species with common SNPs (Hamilton et al. 2019a; Hamilton et al. 2019b), where parents in each are mutually exclusive. Furthermore, it is conceivable that management of selective breeding populations could be altered to allow the adoption of pooling-for-individual-parentage-assignment. For example, in aquaculture selective breeding programmes where there are, in some circumstances, limited facilities to maintain families in separate tanks or hapas prior to tagging, families could be replicated across multiple tanks or hapas each containing multiple families in an orthogonal design so as to allow subsequent pooling-for-individual-parentage-assignment and the partitioning of common rearing environment and genetic effects in genetic analyses. In molecular ecology, pooling-for-individual-parentage assignment is applicable in circumstances where sets of mutually exclusive groups of parents can be identified (e.g., samples from different geographical areas, between which gene flow in a single generation is not possible).

In the second circumstance, pooling-by-phenotype, individuals are allocated to bins (i.e., classes) according to their phenotype, from which pools are then drawn. The primary limitation of pooling-by-phenotype is that genotypes are not assigned to tagged individuals making the re-identification of candidate parents difficult. However, pooling-by-phenotype does not preclude the adoption of a modified ‘walk-back selection’ approach (Sonesson 2005), in which individual with desirable phenotypes are tagged and individually assigned parentage. Furthermore, it can be adopted as a means of estimating additive co/variances in a cost-effective manner, to examine genotype-by-environment interaction and/or increase the accuracy of estimated breeding values for related individuals (Burdon 1977; Henderson and Quaas 1976). Estimating genetic additive co/variances using a pooling-by-phenotype approach, is most simply achieved by generating dummy identifiers for individuals in pools, allowing an additive relationship matrix to be computed using established methods (Henderson 1975; Henderson and Quaas 1976). A further drawback of pooling-by-phenotype in circumstances where multiple traits are measured is that multiple-trait bins must be generated, each containing only those individuals with phenotypes within a specified phenotypic range for each trait (Bell et al. 2017). This potentially requires the range of phenotypic values in any one pool (i.e., bin width) for any one trait to be large, resulting in less precise phenotypes.

Ability to assign parentage using ML, whether applied to pools or individuals (Jones et al. 2010), is a function of the number of possible parents (Supplementary Materials 2 and 3) and the extent to which their contributions to progeny are known; the extent to which parents are genotyped (Supplementary Materials 4), the degree of relatedness among individuals (Supplementary Materials 5); the number (Supplementary Materials 3 and 4), information content, linkage disequilibrium (LD) and neutrality of SNPs (Holman et al. 2017); and the extent of SNP expression and genotyping accuracy/error (Anderson and Garza 2006; Liu et al. 2016; Weinman et al. 2015) (Supplementary Materials 5 and 6). Furthermore, in the application of quantitative ML to pools, variation in DNA contributions among individuals in pools (e.g. due to unequal tissue contributions and/or differences in DNA amplification) reduces the accuracy of genotyping and parentage assignment (Barratt et al. 2002; Bell et al. 2017; Kinghorn et al. 2010). Variation in DNA contributions can be minimised by pooling samples after DNA extraction, rather than pooling tissue samples prior to DNA extraction. However, pooling DNA rather than tissue substantially increases DNA extraction costs and, if the benefits of parentage assignment to pools is to be fully realised, pools must be constructed in a logistically sensible and cost-effective fashion. Furthermore, before application to pools in a new population or circumstance, the quantitative ML should be validated by applying the method to pools of individuals with known pedigree using the SNPs, species, assay and platform to be applied.

Although the cost of pedigree assignment per individual can be reduced by increasing pool size it must be recognised that, in the application of quantitative ML, this increases both the percentage of individuals with Δ LOD values below the critical Δ LOD (and thus the number of rejected assignments) and the computational burden. Quantitative ML pools is computationally intensive for large pools, as the number of possible combinations of parents increases exponentially with pool size. It has been shown herein that the quantitative ML method can be practically implemented for pools of three—all simulations were conducted on a personal computer. However, access to high performance computing facilities is likely to be necessary if the method is to be applied to larger pool sizes.

Genotype data from pools present like polyploidy data. In both cases, calling of discrete genotypes is prone to error due to the large number of possible genotype classes at each SNP (Clark et al. 2019; Rahman et al. 2015), making accurate parentage assignment difficult using discrete ML and exclusion methods (Flanagan and Jones 2019; Spielmann et al. 2015; Wang and Scribner 2014). Accordingly, generalisation of the quantitative ML approach to polyploids, with differing modes of inheritance (Clark et al. 2019), would likely improve the accuracy of ML parentage assignment and merits further investigation—the use of quantitative genotypes has been adopted to increase the power of genomic prediction (de Bem Oliveira et al. 2019) and genome-wide association studies (GWAS; Grandke et al. 2016) in polyploid populations. Modification of quantitative ML and pooling to circumstances where parental genotypes are largely unknown—such as kinship analysis (Hamilton et al. 2019a; Hamilton et al. 2019b)—also warrants further development.

In conclusion, quantitative ML can be used to assign parentage to individuals and pools, to a desired level of confidence, using low-density SNP data. Moreover, parentage is assigned with greater accuracy using quantitative ML than by discrete ML (Kalinowski et al. 2010; Marshall et al. 1998); exclusion (Chakraborty et al. 1974), or solving the weighted least squares problem (Henshall et al. 2014; Kinghorn et al. 2010). The method is applicable to pools constructed using pooling-for-individual-parentage-assignment or pooling-by-phenotype approaches and has the potential to substantially reduce the cost of parentage assignment, even if applied to pools comprised of few individuals. However, before application in applied breeding programmes quantitative ML should be validated using pools of known pedigree and tissue/DNA contributions, using the SNPs, species, assay and platform to be applied; and the inherent limitations of using low-density SNP panels to assign parentage to large pools must be recognised. Generalisation of the quantitative ML approach to polyploids and kinship analysis applications warrants further investigation.