Mining whole genome sequence data to efficiently attribute individuals to source populations

Whole genome sequence (WGS) data could transform our ability to attribute individuals to source populations. However, methods that efficiently mine these data are yet to be developed. We present a minimal multilocus distance (MMD) method which rapidly deals with these large data sets as well as methods for optimally selecting loci. This was applied on WGS data to determine the source of human campylobacteriosis, the geographical origin of diverse biological species including humans and proteomic data to classify breast cancer tumours. The MMD method provides a highly accurate attribution which is computationally efficient for extended genotypes. These methods are generic, easy to implement for WGS and proteomic data and have wide application.

L u,s = L l=1 π u l ,l,s , used in many assignment tests [2][3][4][5][6][7] corresponds to the probability that d H (u, a i,s ) = 0, i.e. the probability that the genotype u exists in source s. Genetic distances used in distance-based assignment tests [4,5], can also be expressed in terms of the probabilities {π u l ,l,s }. For example, Nei's D A distance [8] between the individual to be assigned and source s is We note that some classical genetic distances [8] such as Nei's standard genetic distance, D S , or Nei's minimum genetic distance, D m , depend on the gene identity [9] of the sources, J s = L −1 L l=1 a∈A π a,l,s , in addition to the probabilities {π u l ,l,s }. For example, Nei's standard genetic distance between u and source s is The gene identity is intrinsic to sources and does not reflect the similarity between the individual to be attributed and sources. In general, methods based on D S and D m will predict a higher attribution to the source with lower gene identity but this has nothing to do with the individual to be attributed.

II. ATTRIBUTION ERRORS ASSOCIATED WITH ERRORS IN ALLELE PROBABILITIES
As mentioned in the main text, errors in the estimates of allele probabilities {π a,l,s } used to characterise sources will induce an error in attribution. Here we estimate the dependence of the attribution error on the number L of loci in the genotypes and the number I s of genotypes used to describe each source.

A. Attribution error for the MMD method
For the MMD method, errors in the estimates of the allele probabilities propagate to the quantile λ u,s (q), score σ u,s and attribution probability p u,s defined in the Methods of the main text. The dependence of the errors of λ u,s (q) and σ u,s on L and I s can be estimated for a simple model for unlinked loci in which alleles have the same probability distribution for all loci, i.e. a model with π u l ,l,s = r s independently of l. In this case, the Hamming distance obeys a binomial distribution for L Bernoulli trials with probability of success 1 − r s . In the limit of large L, the binomial distribution can be approximated by a normal distribution with mean µ s = L(1 − r s ) and variance ∆ 2 s = Lr s (1 − r s ). Under these assumptions, the quantile λ u,s (q) satisfies and the score σ u,s quantifying the proximity of genotype u to source s is Here, λ min = min s {λ u,s (q)} and Φ −1 (x) is the inverse of the cumulative distribution function for the standard normal distribution. From Eq. (2), the error of λ u,s (q) in the limit of extended genotypes with large L is given by δλ u,s = ∂λ u,s ∂r s δr s Lδr s .
Here, δr s is the error in the allele probabilities. In the MMD method and other methods that approximate these probabilities by the observed allele frequencies, the error is δr s = O(I −1/2 s ). Therefore, δλ u,s LI −1/2 s .
Since λ u,s L (cf. Eq. (2)), we conclude that the relative error of λ u,s is δλ u,s /λ u,s = O(I −1/2 s ), i.e. it does not increase with the number of loci, L.
Let us denote the closest source to individual u as s closest (this is the source with λ u,s closest = λ min ). From Eq. (3), the error in the assignment score σ u,s is given by: δσ u,s = ∂σ u,s ∂r s δr s + ∂σ u,s ∂λ min δλ min aL 1/2 e −bL 2 δr s , for s = s closest a L 1/2 δr s , for s = s closest .
Here, a, a and b are independent of L and we have assumed that δr s is approximately the same for all sources, including s closest . One can show that the error for the attribution probability p u,s is proportional to that of δσ u,s . To summarise, our arguments show that the assignment error for the MMD method is O(L 1/2 ). In the particular case in which the allele probabilities are estimated by frequencies, one has δr s = O(I δπ u l ,l,s π u l ,l,s LL u,s , where we have assumed δπ u l ,l,s > 0 for all loci. According to Eq. (7), the relative error of the likelihood function, δL u,s /L u,s , increases with L unless the errors in the probability estimates, {δπ u l ,l,s }, are zero.
The log-likelihood function is more commonly used than the likelihood itself. One can easily show that the error for the log-likelihood function typically equals the relative error of L u,s and is therefore O(L). This shows that attribution errors based on a likelihood function increase faster with L than those for the MMD method.