Design of training populations for selective phenotyping in genomic prediction

Phenotyping is the current bottleneck in plant breeding, especially because next-generation sequencing has decreased genotyping cost more than 100.000 fold in the last 20 years. Therefore, the cost of phenotyping needs to be optimized within a breeding program. When designing the implementation of genomic selection scheme into the breeding cycle, breeders need to select the optimal method for (1) selecting training populations that maximize genomic prediction accuracy and (2) to reduce the cost of phenotyping while improving precision. In this article, we compared methods for selecting training populations under two scenarios: Firstly, when the objective is to select a training population set (TRS) to predict the remaining individuals from the same population (Untargeted), and secondly, when a test set (TS) is first defined and genotyped, and then the TRS is optimized specifically around the TS (Targeted). Our results show that optimization methods that include information from the test set (targeted) showed the highest accuracies, indicating that apriori information from the TS improves genomic predictions. In addition, predictive ability enhanced especially when population size was small which is a target to decrease phenotypic cost within breeding programs.

1. Figure S1. Genotypes selected from the optimization algorithm are plotted on the principal components 1 and 2 analysis in dataset 1. The genotypes were selected based on AOPT, CDmean, DOPT, PEVmean, CDMEANMM, PEVMEANMM from the STPGA package. Figure S3. Prediction accuracies for stripe rust severity time 1 trait using sampling algorithms within STPGA package on dataset 1. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 300, 600 and 1000) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. Figure S4. Prediction accuracies for stripe rust severity percentage trait using sampling algorithms within STPGA package on dataset 1. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 300, 600 and 1000) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. Figure S5. Prediction accuracies for adult stripe rust Reaction Time 1 trait using sampling algorithms within STPGA package on dataset 1. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 300, 600 and 1000) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. 5/21 6. Figure S6. Prediction accuracies for adult stripe rust Reaction Time percentage trait using sampling algorithms within STPGA package on dataset 1. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 300, 600 and 1000) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. Figure S7. Prediction accuracies for heading date using sampling algorithms within STPGA package on dataset 1. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 200, 300 and 400) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. (random) runs. 7/21 8. Figure S8. Prediction accuracies for height using sampling algorithms within STPGA package on dataset 2. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 200, 300 and 400) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. (random) runs. 8/21 9. Figure S9. Prediction accuracies for Lodging time 1 using sampling algorithms within STPGA package on dataset 2. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 200, 300 and 400) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. (random) runs. 9/21 10. Figure S10. Prediction accuracies for Lodging time 2 using sampling algorithms within STPGA package on dataset 2. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 200, 300 and 400) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. (random) runs.

7.
10/21 11. Figure S11. Prediction accuracies for Protein using sampling algorithms within STPGA package on dataset 2. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 200, 300 and 400) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. (random) runs.

11/21
12. Figure S12. Prediction accuracies for Test Weight using sampling algorithms within STPGA package on dataset 2. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 200, 300 and 400) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. (random) runs.

12/21
13. Figure S13. Prediction accuracies for Test Wight using sampling algorithms within STPGA package on dataset 2. Accuracies of the predictions of the test set (TS) genotypes were calculated using 4 different algorithms and 2 methods compared with random sampling. In the U-Opt method, the TS were not used to build the training population set (TRS) while in the T-Opt the optimization algorithm used the TS to build the TRS. The TRS were defined by optimizing A-Opt, D-Opt, CDmean, and PEVmean. Four different population sizes (100, 200, 300 and 400) were used for the optimization algorithm. Standard error is indicated for each point over 30 (U-Opt and T-Opt) and 100 (random) runs. (random) runs.

13/21
14. Figure S14. Degree of sacrificed accuracy using the approximations based on the use of principal components. CDMEAN and PEVMEAN based on mixed models are compared with their approximations based on ridge regression based approximations using 10, 50 and 100 principal components for the Dataset 1 and trait height. 14/21 15. Figure S15. Degree of sacrificed accuracy using the approximations based on the use of principal components. CDMEAN and PEVMEAN based on mixed models are compared with their approximations based on ridge regression based approximations using 10, 50 and 100 principal components for the Dataset 2 and trait yield.  Figure S16. Effect of increasing target size for dataset 1, fixed training population size at 100 individuals. A: AOPT, D: DOPT, C: CDMEAN, CT: CDMEAN Targeted, P: PEVMean, PT: PEVMEAN Targeted, R: Random. In general, the effect of increasing the target population size is to decrease the advantages from selecting an optimized targeted training population against an optimized untargeted training populations. However, the optimized training populations retain their advantage over the random training populations. 17. Figure S17. Effect of increasing target size for dataset 1 fixed training population size at 300 individuals.A: AOPT, D: DOPT, C: CDMEAN, CT: CDMEAN Targeted, P: PEVMean, PT: PEVMEAN Targeted, R: Random. In general, the effect of increasing the target population size is to decrease the advantages from selecting an optimized targeted training population against an optimized untargeted training populations. However, the optimized training populations retain their advantage over the random training populations.  Figure S21. Dataset 1. Estimating a test population other than the one that is targeted. Target size 100. A: AOPT, D: DOPT, C: CDMEAN, CT: CDMEAN Targeted, P: PEVMean, PT: PEVMEAN Targeted, R: Random. The optimized targeted and the optimized untargeted training populations perform similarly for predicting a population other than the one that is targeted.