A look-ahead Monte Carlo simulation method for improving parental selection in trait introgression

Multiple trait introgression is the process by which multiple desirable traits are converted from a donor to a recipient cultivar through backcrossing and selfing. The goal of this procedure is to recover all the attributes of the recipient cultivar, with the addition of the specified desirable traits. A crucial step in this process is the selection of parents to form new crosses. In this study, we propose a new selection approach that estimates the genetic distribution of the progeny of backcrosses after multiple generations using information of recombination events. Our objective is to select the most promising individuals for further backcrossing or selfing. To demonstrate the effectiveness of the proposed method, a case study has been conducted using maize data where our method is compared with state-of-the-art approaches. Simulation results suggest that the proposed method, look-ahead Monte Carlo, achieves higher probability of success than existing approaches. Our proposed selection method can assist breeders to efficiently design trait introgression projects.

Although there does not currently exist much literature to integrate operations research techniques and trait introgression, there are still a few impactful studies. Cameron et al. utilized an operations research framework with a stochastic optimization model to identify the best breeding strategies for a given population under resource constraints 28 . This work illustrates the potential optimization modeling can have on resource allocation in plant breeding. Probabilistic simulation techniques have also been performed by Sun and Mumm 29 to assess in silico various crossing schemes and breeding approaches. Moreover, Han et al. 30 has framed trait introgression as an algorithmic process and introduced a novel selection metric, predicted cross value (PCV), which predicts specific combining ability by estimating the probability that a pair of parents will produce a perfect gamete with all desirable alleles.
Due to the importance of optimizing the breeding pipeline and the need to consider resource limitations for large scale breeding programs, this paper aims to design a platform that integrates operations research methods to trait introgression. Specifically, the authors develop a novel Monte Carlo simulation approach for the TI pipeline to consider the parental selection aspect under different scenarios of resources present within a commercial scale TI program. The originality in the proposed method, look-ahead Monte Carlo (LMC), is to look ahead and estimate the performance of progeny in the target generation and then optimize the selection decisions based on the estimated performance. In this study, we use computer simulations to compare selection strategies with respect to the recurrent parent background gene recovery percentage of individuals in the final generation.

Methods
In this section, we first define the problem by describing the backcrossing breeding pipeline and introduce two existing selection methods. Then, we propose the novel look-ahead Monte Carlo selection method.
Problem definition. The following abbreviations will be used in subsequent sections in this paper: DR: donor RP: recurrent parent BCt: backcross population at generation t BCTF2: self-fertilized population after final backcross GEBV: genomic estimated breeding value PCV: predicted cross value LMC: look-ahead Monte Carlo The general objective of trait introgression projects is to produce a new line that is highly close to the recurrent parent, and contains the desired alleles or traits from the donor parent. First an initial cross is made between the donor and recurrent parent to produce F1 progeny. Since, the donor and recurrent parents are both homozygous, this step is deterministic which means the F1 progeny has 50% of the genetic material from each parent. Next, the F1 individual is crossed to the recurrent parent to develop a backcross one (BC1) population. Figure 1 represents a schematic overview of the backcross project where the ultimate goal is to produce drought resistant individual plants with good agronomics. In Fig. 1, we see n individuals in the backcross one population denoted with BC1 1 , BC1 2 , ..., and BC1 n . Best individuals from the BC1 population were selected based on a selection strategy and then again backcrossed to the recurrent parent.
In successive generations, progeny are first selected for the trait of interest and then backcrossed to the recurrent parent. This process is repeated for T backcross generations. We refer to an individual as positive if it contains the desirable alleles from the donor. For positive individuals in BCT population, the percentage of beckground recovery was calculated by dividing the number of desirable alleles in the background by the total number of background alleles. Furthermore, we monitor and evaluate the BCTF2 individuals.
The donor parent (poor agronomics, drought resistant) is crossed with the recurrent parent (good agronomics, drought sensitive) to produce F1. F1 progeny have 50% of their genetic material from each parent (yellow square: favorable allele, purple square: unfavorable allele). Then, F1 is backcrossed with the recurrent parent to develop the BC1 population. Best individuals from BC1 are selected based on a predefined metric and backcrossed to the recurrent parent. This process is repeated for T generations. The ultimate goal is to achieve an individual which is drought resistant and has good agronomics.
To simulate the recombination process during meiosis we used the same inheritance distribution defined in 30 . In subsequent sections, the recombination frequency vector is denoted by r ∈ [0, 0.5] L−1 , where L is the total number of markers in the genome. To represent the genotype of an individual plant, we use an L × m binary matrix, say G ∈ B L×m , where G l,m = 1 indicates whether the l th allele from chromosome m is desirable or not ( G l,m = 0 ). For each individual plant represented with a binary matrix, each row is a locus in the genome. The number of columns in the binary matrix represents the ploidy of the plant. We use diploid species in this paper ( m = 2 ). Here, we review two existing approaches for parental selection.
Background selection. The background selection approach first selects the individuals with desired marker genotypes and then among these positive individuals selects for the desired background genotype 3,7,31 . Background selection has been shown to be efficient by previous theoretical work 3,31-33 and experimental work 2 .
The breeding value of a background genotype can be estimated using genomic estimated breeding value (GEBV) 34,35 . GEBV of individual plants (or animals) is defined as the sum of their estimated marker effects 34 . We assume D = {d 1 , d 2 , ..., d z } is the location of the positive markers from the donor and there are total Z markers that should be introgressed. If we assume uniform weight for all desirable alleles, then the background GEBV of an individual is equivalent to the number of desirable alleles in its background: www.nature.com/scientificreports/ According to this approach, the positive individuals with highest GEBVs will be selected as parents.
Predicted cross value selection. The predicted cross value (PCV) calculates the probability that a pair of breeding parents will produce a gamete with desirable alleles at all specified loci by taking into account the recombination frequencies 30 . This approach selects individuals based on their likelihood to produce an elite gamete by combining all desirable alleles. Since in a backcrossing scheme, individuals are always crossed with the recurrent parent, the PCV can be defined as the probability that each individual will produce an elite gamete. Let g ∈ B L×1 denote a random gamete produced by a breeding parent. The PCV of an individual is calculated as follows: www.nature.com/scientificreports/ To calculate this probability, the same water-pipe algorithm described in 25 is used. The rationale for the PCV definition is to calculate the probability that none of the undesirable alleles survives two generations of meiosis 30 . According to this approach, the positive individuals with highest PCVs will be selected as parents.
Proposed look-ahead Monte Carlo algorithm. In this section we propose a novel probabilistic and heuristic driven search algorithm, look-ahead Monte Carlo (LMC) for parental selection. The underlying concept is to use Monte Carlo simulation for modeling uncertainty involved due to recombination events. Monte Carlo simulation is a technique that relies on repeated random sampling to obtain numerical results 36 . This technique is often used in physical and mathematical problems and is most suited to be applied when it is impossible to obtain a closed-form expression or infeasible to apply a deterministic algorithm 37 .
The look-ahead Monte Carlo algorithm for parental selection evaluates different selection decisions periodically during the learning phase by predicting the genetic distribution of the progeny of backcrosses after multiple generations using information of recombination events. This algorithm makes a trade-off between exploration and exploitation. It exploits the selection strategies that is found to be best until the current generation and also explores the alternative decisions to find out if they could replace the current best. The essence of this algorithm is to strategically search the space to find optimal crosses that can result in best performance in the targeted generation. Figure 2 presents an overview of the LMC algorithm. For every individual in BCt population (e.g., BCt i ), multiple random gametes are simulated according to the recombination frequencies. These gametes are narrowed down to the ones which have the desirable markers from the donor in the introgressed loci. Then, one of these positive gametes is selected randomly to form the next BC progeny (e.g., BC(t+1) i ). This process is repeated until the target generation (BCT). Finally, individuals are evaluated based on their performance after selfing (BCTF2).
In BCTF2, success can be defined as achieving certain amount of recovery percentage (e.g., 95% ) among positive individuals. Suppose the population size of the BCTF2 generation is K and n individuals with desirable markers have achieved the desired recovery percentage. Then n K is the probability of getting a positive individual that has met the background recovery requirements. Since through backcross generations the gametes are selected randomly, this probability is estimating only one of the possible outcomes for individual i in BCt population. To have a reasonable approximation for the performance of progeny in BCTF2, the same process should be repeated multiple times. The objective of the LMC algorithm can then be calculated as: where v j represents the maximum recovery percentage achieved in BCTF2 for the jth round, and P is total rounds of repetition. According to LMC approach, individuals with highest Q values will be selected as the breeding parents.

Results
In this section, we first describe the data sets used in this case study, and then compare the proposed method with two existing selection methods in different scenarios of resources using computer simulation.
Data. Data contains donor and recipient's genetic information and recombination frequencies. To explore the effect of having different initial genetic similarities between donor and recurrent parent, we considered three cases as demonstrated in Table 1. The genetic similarity is calculated based on the NEIs metric 38 . Cases 1, 2, and 3 have low, moderate and high initial genetic similarities respectively. Our goal is to compare the performance of selection strategies using these 3 different cases given that the low initial genetic similarity (case 1) is expected to be more difficult relatively.
The genetic information of donor and recurrent parent for these three cases are illustrated in Fig. 3. For all cases, three markers should be integrated from the donor to the recurrent parent. Furthermore, the recombination events are presented in the supplementary information.
Simulation settings. Multiple trait introgression was studied using realistic maize data with three different selection methods, including GEBV, PCV, and LMC. We considered three cases with different genetic similarities and two different scenarios for resources. These scenarios are designed considering the practical aspects of a breeding program. In scenario 1, we are allocating limited resources by making 2 crosses in each generation whereas in scenario 2, we are allocating moderate resources by making 6 crosses in each generation (see Table 2). Scenario 1 more closely resembles what occurs in a commercial breeding program, namely, decision making with limited resources.
One hundred independent simulation replicates were performed for each of the selection methods using MATLAB (R2019-a). Simulation has been performed for three generations of backcrosses followed by a selfing. The evaluation is based on the recovery percentage of individuals in BC3F2 generation.
It is assumed that each cross makes 200 progeny and for each scenario the number of crosses remains the same through all generations (i.e. resources are distributed evenly among different generations). Figure 4 presents the performance of three selection methods for one sample simulation. The histograms of background recovery percentage (2) PCV (G, r) = Pr(g i = 1, ∀i ∈ {1, 2, ..., L})  paths. This is a schematic overview of monitoring the estimated performance in BC3F2 for a single path. The same process is repeated multiple times and then the average value is assigned to BC1 i .  Figure 3. Donors and recurrent parents' genetic information and recombination frequencies for three cases (DR: donor, RP: recurrent parent, r: recombination frequency). A yellow square is used to denote a favorable allele ("1") and a purple square is used to denote an unfavorable allele ("0"). The gray charts are heat maps for recombination frequencies. It should be noted that the BC3F2 individuals should have all 6 alleles desirable in the three markers that are to be integrated from the donor (i.e. BC3F2 individuals are homozygous). However, the BC individuals are expected to have 3 desirable alleles total since their second chromosome is being inherited from the recurrent parent. This can explain why recovery percentage drops from BC3 to BC3F2. As demonstrated in Fig. 4, for this sample simulation, the LMC method achieves 95% recovery in BC3F2, however the other two selection methods achieve maximum 91% recovery.

Simulation results. Comparison of selection methods for one sample simulation
Background recovery percentage of the top individual in BC3 across all simulation replicates Figure 5 compares the cumulative distribution functions (CDFs) of maximum recovery percentage achieved in BC3 for three selection methods among 100 simulation replicates. The further toward the right direction a CDF curves, the better performance a method has. Take for example, point (97, 75) means that 75% of the simulations have achieved recovery percentage less than or equal to 97. In all cases and scenarios, the LMC method achieves higher recovery percentage. www.nature.com/scientificreports/ For case 3 which has the highest genetic similarity between donor and recurrent parent, there was one simulation that resulted in having one individual in BC3 with all desirable traits (100% recovery percentage). Note that since this is a backcross generation, for this individual the second chromosome still lacks the desirable alleles from the donor. As expected, for each case, scenario 2 has better performance compared to scenario 1 since there are more resources available.
Average background recovery percentage of top 10 individuals in BC3 across all simulation replicates Figure 6 presents the box-plots of average recovery percentage of top 10 individuals in BC3 generation. For all cases and scenarios, the median value is higher when selection decisions are optimized using LMC method. Furthermore, PCV has generally higher median values than GEBV. The overall range of values is greater for LMC method (as shown by the distances between the ends of the two whiskers for each box-plot). The interquartile ranges are reasonably similar (as shown by the lengths of the boxes), except for case 2, scenario 1, where LMC has considerably higher range.
Background recovery percentage of the top individual in BC3F2 across all simulation replicates Figure 7 compares the probability of success for three selection methods by evaluating recovery percentage of best individual in BC3F2. For example, point (0. 8 As expected, scenario 2 has generally higher probability of success compared to scenario 1 as more resources are used. Take for example, for case 2 the probability of achieving 95 percentage recovery with LMC method increases from 0.74 to 0.83 when having more resources. This probability also increases from 0.59 to 0.71 for PCV and from 0.54 to 0.69 for GEBV method. Furthermore, the probability of success increases from case 1 to 3 where there is more genetic similarity between the donor and parent. Take scenario 2 for example, the probabilities of having 95 percentage recovery when selection decisions are optimized using LMC method are 0.81, 0.83, 0.89 for cases 1, 2, and 3, respectively.

Discussion
Selection methods based on marker information make trait introgression more efficient and effective. When introgressing the desired traits form a donor to a recipient, background selection is the conventional selection approach that aims to recover the desired background genome. Recent advances in optimization and simulation techniques can help enhancing the efficiency of parental selection in breeding programs. www.nature.com/scientificreports/ In this study, we introduced a new selection method, LMC, which has the potential to further improve the efficiency of breeding given limited time and resources by integrating operations research techniques and trait introgression. The proposed method was compared with existing selection methods in a simulation study using empirical maize data. Computational results demonstrate the improvements of the LMC method over two existing selection approaches, GEBV and PCV.
One of the advantages of the LMC method is being sensitive to the deadline. Unlike other selection methods that evaluate the performance based on only next generation, the LMC method relates the objective to the performance of individuals in the targeted generation. Another advantage of the LMC method is the trade-off between exploration and exploitation. When the look ahead process finds exploitation to contribute more to the final objective, the algorithm behaves in a greedy way to maximize performance. However, when the exploration is found to be more beneficial, the algorithm explores new possible outcomes. www.nature.com/scientificreports/ The simulations in this study were designed based on practical considerations. The trait introgression pipeline included three backcross generations followed by a selfing so that selected individuals will be homozygous for the target trait. There is no absolute number for the number of backcrosses needed to be performed but generally between two to five backcrosses are performed in maize. The number of required generations can be determined based on the breeding objective and the resources invested at each generation 39 . Intuitively, making more crosses and producing more progeny leads to a higher chance of creating desirable individuals, however the resources are limited and the breeding strategy should be customized based on available resources. Here, we considered two scenarios to represent both limited and moderate cases of resource availability. Scenario 1 limits the number of crosses to two in each generation where as scenario 2 allows six crosses. According to reproductive biology of maize, it is possible to obtain ≈ 200 seeds from a cross. Thus, we assumed each cross makes 200 progeny which means for scenarios 1 and 2, the population size of each generation becomes 400 and 1200 respectively. As expected, simulation results demonstrated that the probability of success increases when having more resources.
Future work should investigate optimizing the resource allocation strategies by spreading out the budget systematically among different generations. Also, this study investigated introgressing desirable alleles from a single donor, however desirable alleles can be carried by multiple donors. Hence, another direction that deserves investigation is to extend the LMC method for the cases with multiple donors. Moreover, we based our simulations on a single crop organism. Further simulations considering more diverse populations are necessary to demonstrate the general applicability of the proposed selection method.