Joint DNA-based disaster victim identification

We address computational and statistical aspects of DNA-based identification of victims in the aftermath of disasters. Current methods and software for such identification typically consider each victim individually, leading to suboptimal power of identification and potential inconsistencies in the statistical summary of the evidence. We resolve these problems by performing joint identification of all victims, using the complete genetic data set. Individual identification probabilities, conditional on all available information, are derived from the joint solution in the form of posterior pairing probabilities. A closed formula is obtained for the a priori number of possible joint solutions to a given DVI problem. This number increases quickly with the number of victims and missing persons, posing computational challenges for brute force approaches. We address this complexity with a preparatory sequential step aiming to reduce the search space. The examples show that realistic cases are handled efficiently. User-friendly implementations of all methods are provided in the R package dvir, freely available on all platforms.

www.nature.com/scientificreports/ Another problem with restricted methods is that they may produce inconsistent results. For example, since conclusions are reached independently for each victim (or family), it may happen that a victim is classified as being the most likely member of two different families.
Our goal has been to present methods and implementations that provide consistent solutions to DVI problems, by considering all available data simultaneously. This is achieved by Algorithm 3 in the Methods section, which finds the most likely solution among all possible, while keeping the need for brute force calculations to a minimum.
Decision makers and the legal system often require independent conclusions for each missing person. In response to this, we provide formulas for posterior pairing probabilities for each victim -missing person pair.
To the best of our knowledge, no freely available software offer joint DVI computations. The restricted strategies mentioned above are implemented in Familias 11 , but currently allow only one missing person in each family. Commercial software like Bonaparte 12,13 and DNA View 3 provide similar functionality, but precise details on the implementation are not publically available.
To rectify this we have developed the R package dvir, based on the ped suite ecosystem for pedigree analysis in R 14 . The data sets analysed in this paper are included as part of dvir, and further examples are given in the documentation. The source code is freely available from https:// github. com/ thoree/ dvir.

Methods
The starting point of our investigations is a DVI situation involving s victim samples, hypothesised to belong to some or all of m missing persons (MPs). Identification is done by genetic matching against relatives of the missing persons, using a battery of forensic markers.
In our examples we assume all markers to be independent autosomal markers in Hardy-Weinberg Equilibrium (HWE). However, it should be noted that the overall approach applies very generally; in fact the only A toy DVI problem. The PM data consists of 3 victim samples to be matched against 3 missing persons (red) belonging to two different families. The AM data contains profiles from the reference individuals R 1 and R 2 (blue), one from each family. Squares and circles represent males and females, respectively. The hatched individuals are typed with a single marker, with genotypes as shown.   We proceed to describe the input data in a bit more detail, and introduce some important notation.

PM data
The post mortem (PM) samples are denoted V 1 , . . . , V s . We assume throughout that these belong to different individuals. In practice, samples gathered from disaster victims often contain duplicates, requiring a preprocessing step in order to identify and merge these 15 .
In our examples we also assume that the sex of each V i is known, with s F females and s M males so that s F + s M = s . We note that this assumption is not vital to our methods, but helps to narrow the search space. AM data The ante mortem (AM) data consist of one or more reference families, each containing at least one missing person (denoted M 1 , M 2 , . . . , M m ) and at least one genotyped reference member (denoted R 1 , R 2 , . . . ). Again we assume known sex of all family members. In particular, let m F and m M be the number of female and male missing persons, respectively, with m F + m M = m.
A possible solution, referred to as an assignment, to the DVI problem we are addressing, is a one-to-one correspondence between a subset of V = {V 1 , . . . , V s } and a subset of M = {M 1 , . . . , M m } , with the requirement that all identifications are sex consistent. For example, in Fig. 1, a consistent assignment is {V 1 = M 2 , V 3 = M 3 } . Alternatively, we may write this more compactly as a tuple (M 2 , * , M 3 ) , whose i'th element is the match for V i , or ' * ' if the assignment does not include a match for V i . In the case of Fig. 1 there are in total 14 assignments, as listed in the first three columns of Table 1.
Note that the empty assignment ( * , * , * ) is a valid solution, referred to as the null model below.
The likelihood L(a) of an assignment a is defined as the probability where the fixed parameters include the reference pedigrees, marker allele frequencies and mutation models. To simplify the notation we write L 0 for the likelihood of the empty assignment, i.e., corresponding to the hypothesis that all victims are unrelated to all the missing persons. Moreover, we define LR i,j = L(V i = M j )/L 0 to be the likelihood ratio of the assignment {V i = M j } , giving rise to the pairwise LR matrix, It should be noted that the likelihoods appearing in the definition of LR i,j involve the complete PM and AM datasets. However, simpler calculations are obtained by considering the reduced DVI problem (PM i , AM j ), where PM i is just V i , and AM j consists of data from the relatives of M j . Then it is straightforward to show that In the simple case shown in Fig. 1, the matrix B can be computed by hand. Let us assume that the marker has 10 alleles 1,2, …, 10, with equal frequencies p 1 = · · · = p 10 = 1/10 . We then have L(a) = P(PM and AM data | a, �), . www.nature.com/scientificreports/ For example, the element LR 2,2 is the LR when V 2 = M 2 is tested against the hypothesis that V 2 and M 2 are unrelated. This gives LR = p 1 /(2p 1 p 2 ) = 5 . Obviously, convincing LRs cannot be expected in this case, with only a single marker.
The zero elements of B correspond to sex-inconsistent pairings or exclusions. Furthermore, we see that the DNA data is uninformative for some of the pairings. The entries LR 1,1 = LR 2,1 = 1 result from the fact that M 1 is not related to either of the reference individuals, and imply that he can never be identified unless M 2 is identified first.
The number of assignments. Let A be the set of all sex-consistent assignments for a given DVI problem.
The total number of elements, n = |A | , is a good measure of the problem's size, and may indicate whether a brute force approach is feasible. Consider first the situation where sex is not known neither for victims nor MPs. The total number of assignments is then The reasoning is as follows: For each k, there are s k different subsets of k victims. Each of these can be assigned to m k different subsets of the m missing persons. Finally, each assignment can be shuffled in k! ways.
When the sexes are known, formula (3) applies to females and males independently, and the total number becomes Unsurprisingly, n increases rapidly with the number of victims and missing persons, but depends strongly on the distribution of sexes. To illustrate, Table 2 tabulates the number of assignments with 8 victims and 5 MPs, for all combinations of males/females. The total of 19081 assignments when all victims and MPs have the same sex, is considerably higher than in all other cases. Sequential approaches. Here we describe two natural implementations of the PM-driven search strategy.
As alluded to in the introduction this sequential approach is suboptimal in several ways, but it may be the best option in very large-scale applications. The motivation for including these algorithms here is to expose and clarify implementational details, and to serve as reference for the novel methods described later.
Output A proposed solution to the DVI case in the form of an assignment a. (In case of ties, more than one assignment may result.) Procedure www.nature.com/scientificreports/ (ii) If all elements of B are below T, then stop. Otherwise, let LR i,j be the maximal element of B and store the identification V i = M j . If there are multiple maximal elements, branch off and proceed with one at a time. (iii) Update B by deleting the row and column corresponding to LR i,j . (iv) Repeat steps (ii) -(iii) until the procedure stops.
To illustrate Algorithm 1, consider our running example from Fig. 1, for which the pairwise LR matrix was given in (2). If T > 5 , no identifications are made. For any T ≤ 5 , the above algorithm identifies V 2 = M 2 and V 3 = M 3 (both with LR = 5 ), after which the procedure stops. Hence the reported solution is ( * , M 2 , M 3 ) . As remarked earlier, M 1 cannot be identified with this approach.
The next algorithm is a refinement of Algorithm 1, with the crucial difference that the LR matrix is now recomputed in each step.
Output A proposed solution to the DVI problem in the form of an assignment a. (In case of ties, more than one assignment may result.) Procedure As Algorithm 1, but where step (iii) is replaced with the following: (iii) Update B by deleting the row and column corresponding to LR i,j , and recomputing the remaining LR values conditional on all previous identifications.
When this strategy is applied to the example in Fig. 1, the sequence of updated LR matrices becomes as follows: In both cases, the identified solution is (M 1 , M 2 , M 3 ).
The joint approach. We now consider the possibility (and feasibility) of joint identification of the victims.
Among the list A of all a priori possible assignments, we seek the one that maximises the overall likelihood: An assignment a * is an optimal solution if L(a * ) ≥ L(a) for all a ∈ A . In smaller cases where |A | [as given by formula (4)] is manageable , this may be solved by brute force, i.e., by calculating the likelihood of each assignment, and sorting them in descending order.
Applying this to our running example in Fig. 1, Table 1 lists the likelihoods of all 14 possibilities. It shows that assignment (M 1 , M 2 , M 3 ) is a clear winner, five times more likely than the runner-up. In this case all calculations may be done manually. For example, under the null model ( * , * , * ) all five genotyped individuals are unrelated, giving the likelihood L 0 = 4 · (0.1) 10 and (natural) log-likelihood log(L 0 ) = −21.64 as shown in line 10 of Table 1.
Combined approach. In larger cases the number of possible assignments may be prohibitive for brute force calculations. In this case, we propose the combination approach described below. In brief, the idea is to first use a modified version of Algorithm 2 in order to find undisputed pairings, and then use brute force on the remaining problem. A pairing (V i , M j ) is said to be undisputed if its pairwise-search LR reaches the given threshold T, while all other LR values involving V i or M j are small.
Output A list of assignments, ranked by likelihood. Procedure Step 1 Sequential.
(i) Compute the pairwise LR matrix B.
(ii) Identify all undisputed pairings V i = M j , characterised by LR i,j ≥ T while all other entries in the same row and column are ≤ 1 . If no such elements are found, the procedure stops. (iii) Update B by deleting rows and columns corresponding to undisputed pairings and recomputing the remaining LR s conditional on the same. (iv) Repeat steps (ii)-(iii) until the procedure stops.
(i) Create a list A of sex-consistent assignments involving the remaining individuals. Here, LR a denotes the likelihood ratio comparing a to the empty (null) assignment.
The posterior non-pairing probabilities are computed similarly: If A i, * denotes the set of assignments with no match for V i , we have where the latter equality assumes a flat prior.
For our running example in Fig. 1, the posterior probabilities with a flat prior are given in Table 3. Note that these probabilities are directly calculable from the LR column of Table 1, which provides the likelihood ratios required by formulas (7) and (8). For example, the top left entry is It is reasonable to conclude that V i = M j if q i,j > α for some α close to 1, say α = 0.99 . A less stringent threshold α = 0.5 could be used if the objective is only to find the most likely match. Importantly, as long as α > 0.5 any pairings obtained in this way are consistent, in the sense that two victims cannot be paired with the same missing person. To show this, let V i and V i ′ denote two different victims. Then for any j the sets A i,j and A i ′ ,j are disjoint, so that This implies that q i,j and q i ′ ,j cannot both exceed 0.5.

Results
A comparison of methods. The purpose of the example is to illustrate that the joint approach may succeed in cases where the sequential methods fail.   Fig. 2, where the genotypes correspond to a marker with alleles 1, 2, and 3, with frequencies 0.05, 0.05 and 0.9 respectively. (The precise values of these frequencies are not important.) Given this information, the pairwise LR matrix is found to be As a first observation we note that the column of 1's corresponding to M 1 implies that M 1 cannot identified by any method which uses data from only one victim at a time, such as Algorithm 1.
Next we consider the more reasonable Algorithm 2, which updates B after each new pairing. Clearly, since LR 1,2 = 20 is the highest entry, the procedure starts by identifying V 1 = M 2 . But this means that M 2 has genotype 1/1, which effectively blocks V 2 (who is 2/2) from being identified as M 1 or M 3 . In full detail, the sequence of updated LR matrices becomes as follows: We conclude that Algorithm 2 produces two equally likely assignments, (M 2 , * , M 1 ) and (M 2 , * , M 3 ).
By contrast, Table 4 shows that the optimal solution, when all the data is considered jointly, is the assignment (M 3 , M 1 , M 2 ) . In fact, this is 2, 000/200 = 10 times more likely than either of the solutions found by the sequential method above. Table 5 lists the posterior pairing probabilities under a flat prior.
In order to investigate the practical relevance of joint identification, we conducted a series of simulation experiments based on standard set of forensic markers. After all, the genotypes in Fig. 2 were particularly chosen so as to illustrate the effect, and with multiple markers one might expect such anomalies to be drowned. Unfortunately, this is not the case. Figure 3 compares how the true positive rates (TPR) of Algorithms 1-3 vary with the number of markers, depending on the true solution as indicated in the title of each panel. In each case, 500 sets of DNA profiles were simulated for the set of 35 autosomal markers comprising the database Norwegian-Frequencies available and documented in the R package forrel. The simulations were performed with the forrel function profileSim. All subsequent identifications used LR threshold T = 10, 000 (for Algorithm 3, the threshold applied to the highest joint LR compared with the null). In addition to the three algorithms previously described, we included the TPR of the most likely solution reported by Algorithm 3, whether or not its LR exceeded T. Figure 3 reveals that for this particular DVI problem, the joint method (Algorithm 3) has a TPR near 1 already with 5 markers, while the best sequential (Algorithm 2) needs 20 markers to reach the same. Overall, Algorithm 3 clearly outperforms the others in all cases shown in the top row of Fig. 3. Moreover, it is the only method to reliably reach a conclusion when the true assignment is (M 1 , * , M 3 ).
Case study 1: plane crash. In this and the next section we demonstrate the adequacy of joint identification in real-life scenarios, by analysing two realistic DVI datasets. The first case is based on a simulated plane crash, and features multiple reference families with a single missing individual. The second case involves a large pedigree with many missing members.   In accordance with Algorithm 3, we start by computing the pairwise LR values, with the results given in Table 6. We observe that the identifications V 2 = M 3 , V 4 = M 5 and V 6 = M 2 are undisputed (in the sense defined in Algorithm 3), when applying the threshold T = 10, 000 . In addition, V 1 = M 1 also has a high LR , but does not reach the threshold. Based on these observations we anticipate the solution (M 1 , M 3 , * , M 5 , * , M 2 , * , * ) . Indeed, this assignment is the optimal solution found in the joint analysis, presented in Table 7. It is noteworthy that the many impossible and undisputed pairings in the pairwise LR matrix (Table 6) leave only two assignments for consideration in the joint step. Hence the computational cost of this step is virtually ignorable.
Next we introduce a mutation model, motivated by the possibility that a mutation may explain the lack of identification for M 4 . In fact, an examination of the data reveals that V 3 and R 4 share alleles at all but one marker, suggesting that these may have a parent-child relationship.  www.nature.com/scientificreports/ We use a proportional model with mutation rate 0.001 16 . The choice of model is not significant here, but we note that the property of stationarity enjoyed by proportional models facilitates validation against other implementations. Under this model, the pairwise LR for V 3 = M 4 changes from 0 to 249. A joint analysis produces the top list shown in Table 8. We see that the identification V 3 = M 4 is now convincingly included in the most likely assignment. This observation is reinforced by the posterior pairing probabilities given in Table 9, calculated with a flat prior of π(a) = 1/19081. Case study 2: a large reference family. Our second main example involves the pedigree in Fig. 5, featuring twelve missing individuals ( M 1 , . . . , M 12 ) and six typed references ( R 1 , . . . , R 6 ). Five victim samples ( V 1 , . . . , V 5 ) are to be matched against these. The case is based on an example from a workshop organized by the International Society for Forensic Genetics (ISFG) 17 , but we have adapted the pedigree slightly for notational consistency and simulated new marker data. The simulations were performed using 13 CODIS markers, assum- Table 6. Pairwise LRs for the plane crash example. Only nonzero elements are shown; entries reaching the threshold T = 10, 000 are highlighted.  Table 7. Results of joint analysis of the plane crash example, without mutation modelling. Table 8. The most likely assignments in the plane crash example, when mutations are modeled.  In this case, the pairwise LR matrix (Table 10) shows that there are no undisputed pairings (with T = 10, 000 ). Admittedly, the pairing V 4 = M 8 has a high LR at 2.85 · 10 6 , but it is disputed (in the sense of Step 1 -(ii) of Algorithm 3) by LR 4,10 and LR 4,11 which both exceed 1. (Relaxations of this step are considered in the Discussion.) Nevertheless, the many zeroes in Table 10 lead to a substantial reduction in the space of assignments, manifested in Step 2 -(ii) of Algorithm 3. More precisely, the a priori 9847 assignments given by equation (4) (with s F = 3, s M = 2, m F = m M = 6 ) is reduced to 1898 after removal of impossible pairings. Joint analysis of these assignments took ∼ 15 seconds on a standard laptop, and resulted in the top list presented in Table 11.
Note that two assignments tie for the best solutions, differing in their identification of victim V 2 . This reflects the fact, deducible from  Table 12.

Discussion
The main contribution of this paper is to show that joint identification generally outperforms sequential DVI methods, and that careful implementation makes the joint approach computationally feasible even in fairly large cases.
PM data AM data   www.nature.com/scientificreports/ From a computational point of view, DVI applications can be viewed as a series of kinship tests. For general issues concerning kinship testing, like the assumptions of Hardy-Weinberg equilibrium and independence of markers, we therefore refer to the rich literature on this subject 18 . Problems related to poor quality of DNA are also widely discussed in the forensic literature 19 . In the following we restrict the discussion to aspects that are particular to the methods of this paper.
The main arguments favouring the joint approach are of principal nature. From a statistical point of view a joint dataset should, by default, be analysed jointly if practically possible. In particular, this avoids inconsistent solutions that may emerge when splitting a joint problem into separate subproblems. However, there are also other important benefits, connected with multiple testing and statistical power. The previously published sequential approaches typically involve many separate tests using different parts of the data. This makes power calculation difficult, if not impossible. In principle one could control for the number of tests performed by increasing the LR threshold, resembling classical post-hoc adjustments in multiple significance testing, but this approach is rarely practical in DVI contexts. The power of our joint approach is not affected by multiple testing issues in the same way, since the problem is not split into many separate tests.
The limiting factor for the utility of joint DVI is the computational burden. Table 2 clearly illustrates the rapid growth of the a priori number of assignments, and it is not difficult to construct problems beyond the current reach. On the other hand, surprisingly large problems may still be manageable if a sufficient number of undisputed matches are found in the first step.
The algorithms we have presented can be modified or tuned in various ways. As in many similar applications, the most important parameter is arguably the LR threshold T. For a general discussion we refer to the established literature 20 , also in connection with familial searching 21 . In the context of DVI we expect the value of T to depend both on the particular protocol and external factors. Simulation experiments like the one summarised in Fig. 3 may provide guidance when deciding the threshold.
We mention one potential modification, which may have a significant impact on the run-time of Algorithm 3. Recall that Step 1(ii) of this algorithm used the pairwise LR matrix to identify undisputed pairings V i = M j , characterised by LR i,j ≥ T while all other entries in the same row and column are ≤ 1.
The last part of this criterion may be relaxed, for instance by increasing the final limit 1 to LR i,j /T . The effect of this change is easily seen in Case Study 2, specifically Table 10, where the pairing V 4 = M 6 would now be classified as undisputed.
For simplicity we used autosomal markers in our examples, but there are no methodological obstructions to including mtDNA, X or Y markers in joint DVI computations. In fact, our implementation in the dvir package already supports X-chromosomal markers. As previous authors have noted, it is not obvious how evidence from different types of markers should be reported, and opinions differ 22 .
A well-known challenge in forensic genetics is that LR calculations are sensitive to misspecified allele frequencies. In some DVI cases it may therefore be difficult to decide on an appropriate frequency database, particularly if the individuals originate from different populations. This problem has previously been addressed in the context of familial searching 23 . A practical approach is to do ad hoc sensitivity calculations. If the overall conclusions remain unchanged with different databases, this strengthens the confidence in the results. As a general comment we note that most autosomal forensic markers have been specifically selected for their relatively stable allele frequencies across populations, while this to a lesser extent holds for mtDNA, X or Y markers.
The question of how the statistical evidence should be reported in identification cases is difficult and lacks general consensus. Although there is a tradition of specifying priors and reporting posterior probabilities in addition to likelihood ratios 10 , our view is that specifying priors should be left to the decision makers. This is supported by ISFG recommendations 6 , whose point 11 includes: In DVI work, DNA statistics are best represented as likelihood ratios that permit DNA results to be combined among multiple genetic systems or with other non-DNA evidence.
Nevertheless, we have included in several tables the posteriors with a flat prior, for reference. In real cases, infor- www.nature.com/scientificreports/ We suggest reporting the identification defined by the optimal assignment if its LR compared to closest contender exceeds a threshold, say 10,000. Clearly, this ensures that the LR against the null also exceeds the same threshold.

Conclusion
This paper presents and discusses methods for DNA-based identification. Restricted approaches, in which the victims are considered separately or sequentially, may give inconsistent, ambiguous results. We therefore generally recommend the combined approach summarised by Algorithm 3. The idea is simple: first take care of the virtually obvious pairings, and then do a complete search to resolve the remaining. The resulting joint solution should be supplemented by posterior pairing and non-pairing probabilities, which summarise the evidence for each individual identification. All methods described in this paper are implemented in the R package dvir, which is freely available from the official R repository (CRAN) and runs on all platforms. The documentation of the package provides further details and instructive examples.