Assessment and modeling using machine learning of resistance to scald (Rhynchosporium commune) in two specific barley genetic resources subsets

Barley production worldwide is limited by several abiotic and biotic stresses and breeding of highly productive and adapted varieties is key to overcome these challenges. Leaf scald, caused by Rhynchosporium commune is a major disease of barley that requires the identification of novel sources of resistance. In this study two subsets of genebank accessions were used: one extracted from the Reference set developed within the Generation Challenge Program (GCP) with 191 accessions, and the other with 101 accessions selected using the filtering approach of the Focused Identification of Germplasm Strategy (FIGS). These subsets were evaluated for resistance to scald at the seedling stage under controlled conditions using two Moroccan isolates, and at the adult plant stage in Ethiopia and Morocco. The results showed that both GCP and FIGS subsets were able to identify sources of resistance to leaf scald at both plant growth stages. In addition, the test of independence and goodness of fit showed that FIGS filtering approach was able to capture higher percentages of resistant accessions compared to GCP subset at the seedling stage against two Moroccan scald isolates, and at the adult plant stage against four field populations of Morocco and Ethiopia, with the exception of Holetta nursery 2017. Furthermore, four machine learning models were tuned on training sets to predict scald reactions on the test sets based on diverse metrics (accuracy, specificity, and Kappa). All models efficiently identified resistant accessions with specificities higher than 0.88 but showed different performances between isolates at the seedling and to field populations at the adult plant stage. The findings of our study will help in fine-tuning FIGS approach using machine learning for the selection of best-bet subsets for resistance to scald disease from the large number of genebank accessions.

www.nature.com/scientificreports/ sampling of larger subsets 53,54 . In addition, new alleles for resistance to powdery mildew in wheat 55 were identified from a targeted FIGS subset. The present study aims at the identification of sources of resistance to scald in GCP and FIGS subsets, and to assess different predictive models for efficient mining of genetic resources for resistance to scald.

Results
Reaction of GCP and FIGS subsets to scald disease. Scald infections were successful under both controlled and field conditions as shown by the uniformity of the infection of the susceptible accessions as well as the checks. The averages scores of susceptible (Tocada and Tissa) and resistant (ICARDA4 and Atlas 46) checks at the adult plant stage were 8 and 1, respectively. While at the seedling stage, average scores of resistant and susceptible checks for isolate SC-511 were 0 and 5, and for the isolate SC-1122 were 1 and 4, respectively.
The two subsets (FIGS and GCP) included sources of resistance to scald at both the seedling and adult plant stages, but their percentages differed based on isolates at the seedling stage and based on location and pathogen populations under field conditions. At the seedling stage, GCP subset showed 6% and 16% resistant accessions, and 79% and 54% susceptible accessions to scald isolates SC-511 and SC-1122, respectively. For FIGS subset, 12% and 34% of the accessions were resistant, while 60% and 21% were susceptible to SC-511 and SC-1122 isolates, respectively (Fig. 1A,B). Different percentages for the moderately resistant and moderately susceptible classes were found among the two subsets for both scald isolates tested. FIGS subset had 8% more accessions with MR and MS reactions compared with GCP for the SC-1122 isolate. For isolate SC-511, 4% of GCP and 2% of FIGS accessions were MR while 12% and 26% showed MS reaction, respectively. Isolate SC-511 showed less percentage of accessions with immune, resistant, and moderately resistant reaction and a higher percentage of susceptible accessions compared to isolate SC-1122 ( Fig. 1).
At the adult plant stage, the percentage of accessions for scald disease severity differed based on the location, year, and subsets. The highest percentages of resistant accessions for both subsets (65% for GCP, and 82% for FIGS) were observed at Guich-Morocco 2018, and the lowest percentages at Holetta 2017 (25% for GCP, and 22% for FIGS) (Fig. 2). For Marchouch 2018, 37% of GCP and 51% of FIGS were resistant, while for Holetta-nursery 2018, these respective percentages were 46% and 53%, and 55% and 61% for Holetta-quarantine 2018. The highest percentages of susceptible (32% for GCP, and 44% FIGS), and highly susceptible accessions (10% for GCP,  www.nature.com/scientificreports/ and 5% for FIGS) were identified at Holetta-nursery 2017 for both subsets, while for all the other environments these classes of reaction ranged from 0 to 10%. When the disease scoring is grouped into three classes, (I + R, MR + MS, S + HS), the highest percentage of resistant accessions were found in the case of Guich Morocco-2018, and the lowest at Holetta-nursery 2017. For the moderately resistant and moderately susceptible classes, the GCP and FIGS subsets had similar percentages except at Guich-2018 where GCP (33%) had almost twice as much as FIGS (17%) (see Supplementary Fig. S1).

Comparison of GCP and FIGS subsets.
The tests of independence showed that the reaction to scald is dependent on the subsets when tested in the field at Guich and Marchouch in Morocco with χ 2 P-values < 0.01 (Table 1). However, this dependency was not observed when the subsets were tested against the scald natural populations in Ethiopia. The same conclusions were confirmed when different classes of disease reactions were combined (see Supplementary Table S1). In case of dependency, FIGS appeared to deviate from theoretical distribution and had higher percentages of accessions with resistant and moderately resistant reactions and lower percentages of susceptible and highly susceptible accessions. At the seedling stage, the reaction types to both isolates were dependent on the subsets for all classes of reactions (P values < 0.01), except for the isolate SC-511 for the grouping (I + R + MR) and (MS + S + HS) with a p-value of 0.201 (Table 2).  www.nature.com/scientificreports/ For the test of goodness of fit, highly significant differences between FIGS and GCP subsets were declared based on a comparison of different classes of disease reactions at Guich and Marchouch (2018) in Morocco with P values < 0.01. These differences were significant (P-value = 0.033) in case of the trial in Holetta-nursery 2018, but not in Holetta-nursery 2017 (see Supplementary Table S2). When the differences were significant, FIGS subset showed higher percentages of resistant classes and lower percentages of susceptible classes than GCP subset at the adult plant stage. When the tests were done at the seedling stage using both isolates, the differences were highly significant between FIGS and GCP subsets for all groups of disease reaction classes except for the isolate SC-511 for the grouping (I + R + MR) and (MS + S + HS) with a p-value of 0.091 ( Table 2).
Prediction of reaction to scald using machine learning approach. In our present study the performance of predictive models varied depending on the environment at the adult plant stage, and on the isolates at the seedling stage. At the seedling stage, the two modeled classes were not balanced for both high and averages scores ( Table 3). The isolate SC-1122 showed a better balance than isolate SC-511 between resistant and susceptible classes. High score exhibited a higher but non-significant accuracy than the average score for both isolates. However, models for average accuracy were significantly higher than Null accuracy with 0.89 and 0.64 for isolate SC-511 and SC-1122, respectively. Models for both isolates, and both scores were not able to identify efficiently resistant accessions due to low sensitivity with maximum value of 0.25.
At the adult plant stage for the combined field data, only two models showed significant and accurate results, BCART and RF with an accuracy of 0.73 and 0.72, respectively (see Supplementary Table S3). Specificity was very low-to-medium for all the four assessed models, and the Null accuracy was higher than 0.5 and reached to 0.67. All models efficiently identified resistant accessions with specificities higher than 0.88. Modeling of field data per location/environment (Table 4) showed different patterns. Accuracy from modeling Marchouch was significantly higher than Null model with an accuracy of 0.81 and Kappa of 0.53.

Isolates Set
Group of 6 classes Group of 2 classes Group of 3 classes χ 2 (P-value) χ 2 (P-value) χ 2 (P-value)  Table 3. Modeling performance from the best machine learning algorithm for seedling data for high (HS) and average (AS) disease scores for scald disease in barley set. BCART Bagged Carts, KNN K-nearest neighbors, SVM support vector machine. www.nature.com/scientificreports/

Discussion
The good development of scald under natural infection in Holetta-Ethiopia and under artificial inoculation with the use of overhead sprinklers irrigation system in Morocco allowed reliable screening of FIGS and GCP subsets. Over the two cropping seasons, scald appeared as a permanent threat to barley in Ethiopia while in Morocco its natural development was more sporadic in time and space. The surveys conducted in North African countries during 1990-1994 through UNDP-ICARDA project on cereals and food legume diseases showed that scald is a major disease of barley in Tunisia, but in Morocco, infections were observed only once out of five years (unpublished data). Epidemics were observed in 1998 in Eritrea, Ethiopia, Turkey, Tunisia, and Morocco 56 . During 2018 and 2019, the visits of farmers' fields in many regions of Morocco showed a low incidence of scald with fields showing medium severity including in mountainous and semi-arid regions (unpublished data). All released varieties in Morocco are moderately to highly susceptible to scald 57 , and host plant resistance was chosen as the most-appropriate approach for integrated management of the disease. Our results showed that resistance was found in both FIGS and GCP subsets at the seedling and adult plant stages. The reaction to the disease differed between isolates, seasons, field locations, and between subsets at both stages. The differences in reactions could be attributed to virulence differences among the isolates at the seedling stage with isolate SC-511 being more aggressive than isolate SC-1122, and to differences in virulence spectra among scald field populations at the adult plant stage (Fig. 1). A similar finding was reported by Albustan et al. 58 , where they found 31% resistant genotypes at the adult plant stage and 19% at the seedling stage.
Although sources of resistance to scald were identified in both subsets at both stages, in general FIGS subset presented higher percentages of resistant accessions (54%) at both stages compared to GCP subset (45%). Likewise, 26% accessions in a Spanish barley core collection 41 , and 13% landraces from a set of lines derived from Syrian and Jordanian landraces 43 were resistant to scald. Our study is the first attempt to compare FIGS to another subset, and the results are supporting the efficiency of this approach in mining genebank collections. Several other studies have shown the limitation of the core collection approach for identifying and searching for rare and adaptive alleles [59][60][61] . In addition, the core collection and the GCP reference set derived from it are targeting higher overall diversity of the accessions, while FIGS is targeting the diversity for resistance to scald. However, more research is needed to assess the efficiency of both GCP and FIGS of finding different effective genes for adaptive traits such as resistance to scald. Furthermore, genotyping and association mapping studies could decipher if resistance is conditioned by new resistance genes or by new allelic variants of the existing resistance genes.
The high likelihood of containing the targeted alleles in FIGS-subset is most probably due to the co-evolutionary process that occurs over time between the host and the pathogen, and which might allow the maintenance of a high number of effective avirulence genes in the pathogen populations. Several studies demonstrated that the environment has a great influence on the development of the disease as well as the genetic structure of both the host and pathogen through processes of natural selection, gene flow, and spatial/geographic differentiation 62,63 . The filtering approach of FIGS was based on environmental conditions best suited for the natural development of the disease and this approach might not be applicable to the favorable areas where improved varieties are used at a large scale, and which has affected the diversity of the pathogen populations by continuous selection of more virulent races. In addition, our results support to some extent the theoretical basis of FIGS approach and presents further evidence that the geographical distribution of the disease resistance in nature is not random, and it is influenced by environmental conditions and the host diversity. However, our FIGS modeling approach using machine learning did not show a strong relationship between resistance to scald and environmental condition for all isolates and field populations. This could be due to the limited size of the sets, and the unbalanced distribution of reaction to scald. In addition, this could also be a result of the variability within a collection site 43 , or a limited correlation between disease resistance and environmental conditions as Silvar et al. 41 reported that resistance was found across the country. However, more resistance was found in winter types in Spain and at higher altitudes of Ethiopia 42 . This geographic structure of disease resistance distribution was also evidenced for stem rust in wheat 64 , and powdery mildew in barley 65 . www.nature.com/scientificreports/ Several previous studies using FIGS approach in different crops supported the evidence of identification of desirable expression traits using eco-geographical data [53][54][55] . FIGS approach was successfully used to identify source of resistance to stem rust and powdery mildew in wheat 55,66,67 , and net-blotch in barley 52 . The superiority of FIGS filtering approach was not confirmed at Holetta-nursery 2017 in Ethiopia, showing the need for refining the sub-setting process through consideration of the virulence spectra of the pathogen populations. The filtering approach can also be refined or improved further by the search of the best predictive models that can explain the relationship between the resistance and agro-climatic conditions using machine learning.
Using single isolates at the seedling stage, and the use of natural and artificial infection under the field conditions, the results showed that the reactions of both FIGS and GCP subsets to R. commune population are highly variable. The instability of responses could be attributed to the variable nature of the R. commune pathogen in terms of pathogenicity, genetic makeup of barley genotypes grown in a region, and response to environmental factors. The major resistance gene will exert selection pressure on virulence allele frequency of the pathogen isolates which will eventually break down the deployed resistance within a short span 68 . Selection is one of the evolutionary factors which can be controlled to some extent by human interventions to slow down the evolution of genetic variants with altered pathogencity. Mert and Karakaya 69 found 7 barley cultivars resistant to all five R. commune isolates from Turkey. In another study, Azamparsa et al. 70 treated a barley cultivar (Tokak 157/37) with gamma irradiation and tested 25 advanced mutant lines with three virulent isolates of scald from Turkey. Though Tokak 157/37 was highly susceptible to three R. commune isolates, two mutant lines were resistant to at least two isolates tested. This approach is suitable to introduce resistance in popular barley cultivars which are difficult to replace. Similar findings of high variability of the pathogenicity of the R. commune were also reported across the world 26,33,71 . All studies suggest complex interaction of scald isolates with barley genotypes and determination of pathotype diversity is essential to develop effective screening protocols.
The observed variability in virulence in Rhynchosporium is probably due to the high genetic variation of the pathogen population as reported by several studies 72,73 . The use of molecular markers showed limited genetic diversity in scald isolates from West Asia and North Africa 74,75 , however, high genetic variation was reported from Australia, California, Finland, and Norway R. commune populations 76,77 . But a clear correlation between molecular markers and pathogenicity of scald populations is lacking which indicate the involvement of other molecular mechanisms which drive the evolution of virulence in R. commune. In general, virulence genes are located on chromosomal regions rich in transposable elements which are more prone to recombination/mutation providing an evolutionary advantage to the pathogen to come up with novel variations in its effector repertoire to evade host defenses 78 . The study of the evolution of necrosis-inducing protein (NIP1) effector from 146 strains of four Rhynchosporium spp. revealed its presence in all species, but presence/absence polymorphism in addition to copy number variation in R. commune revealed a strong selection pressure on this effector protein 79 . The deployment of major resistance genes in rotation along with prudent use of fungicides can reduce the selection of virulent isolates and enhance the longevity of deployed resistant cultivars 80 .
The Focused Identification of Germplasm Strategy (FIGS) based on filtering agro-climatic conditions favoring the development of the disease has shown its efficiency in identifying a higher frequency of resistant accessions compared to the Reference Set of the Generation Challenge Program (GCP). However, more refinement of this approach is needed to take into consideration the specter of virulence of the pathogen and the development of best-bet models linking resistance to environmental conditions using machine learning.

Methods
Germplasm used. A total of 292 genotypes from the ICARDA genebank were evaluated for scald resistance including 101 accessions selected using FIGS approach (FIGS subset), and 191 accessions from the Generation Challenge Program reference set (GCP).
The filtering approach was used to develop FIGS subset based on the following parameters: • Count number of days where the average daily temperature is between 8 and 12 °C, 10 days before the onset of the growing period and up to 15% into the vegetative phase. • Remove sites with zero count from step 1.
• Sum daily rain for 10 days before the onset of the growing period up to 10% into the vegetative phase.
• Normalize both variables (steps 1 and 2) to range 0-1 for each site.
• Add variables to create index 1.
• Rank based on index 1 and remove the bottom 25 percent of sites.
For the remaining sites, the following was done: • From 10% into the vegetative phase until onset of grain filling divide into 3 separate sub-phases of equal length. • For each sub-phase count the number of days where the average daily temperature is between 15 and 20 °C.
• For each sub-phase determine the amount of precipitation.
• Remove sites if any of the variables = 0 (3 count variables and 3 precipitation variables).
• Normalize each variable for a range between 0 and 1.
• Add each variable and then add index 1 to create index 2.
• Rank sites using index 2 from largest to smallest. www.nature.com/scientificreports/ Depending on the required subset size and the number of remaining candidate sites, proceed with the selection of accessions.
• If there are fewer candidate sites than the required set size, choose one accession in a cyclic manner working down the rank and then repeat starting at the top of the rank with sites that still have accessions after each pass. Alternatively, starting from the top ranked site, multiple accessions per site could be chosen until desired set size is achieved. • If there are more sites than the desired set size, then one accession could be chosen randomly from each site starting at the top ranked site until the desired set size is reached. Alternatively, this approach could be taken after one candidate accession is donated by each country represented in the candidate site list. Each entry was seeded in 1 m length row with a row spacing of 0.5 m in an augmented design. Two susceptible (Tocada and Tissa) and resistant (ICARDA4 and Atlas 46) checks were planted after every 10 accessions and each block was surrounded by a perpendicular border row, composed of a mixture of scald susceptible genotypes (Tiddas, Aglou, Tissa, Adrar, Shepherd, Fleet, Baudin and Alester), to maintain high disease pressure. Artificial field inoculations were performed four times starting from Zadoks scale 83 GS30 at an interval of 10-12 days apart in the evenings with inoculum composed of spore suspension composed of a mixture of 50 R. commune isolates collected from different agro-ecological zones of Morocco using a knapsack sprayer. In addition, scald infected barley residues were spread uniformly in the field. Following inoculations, the covering of spreader rows with a plastic sheet overnight and the frequent use of a mist system with overhead sprinklers ensured good disease establishment and uniform spread of the disease. In Ethiopia, the experiment was conducted in Holetta experimental station (9° 00′ N and 38° 30′ E) during 2017 and 2018 cropping seasons using the same design as described above. There, the environmental conditions were favorable to allow a high natural infection level of scald for efficient screening of the germplasm. The disease severity was assessed at GS 73-75 using 0-9 scale 84 . Genotypes were classified in the following categories: Resistant (0-2), moderately resistant (3-4), moderately susceptible (5-6), susceptible (7)(8), and highly susceptible (9). Data analysis. The statistical analysis was performed using R software 85 . The disease assessment data from the seedling and adult plant stage were tested for statistical association between the sub-setting approaches using χ 2 tests of independence with significance level (α = 0.05). The equation used to calculate chi-square is as follows: (1) where is the E ij = expected value, c k=1 O ij is the Sum of the ith column, r k=1 O kj is the Sum of the kth row, N is the total number.
The expected values for the test of goodness were calculated using following equation: where E i is the expected value, n is the total sample size, and p i is the hypothesized proportion of observations in level i. To find out the differences between FIGS and GCP subsets in terms of reaction to scald, the test of goodness of fit using χ 2 test at a significance level (α = 0.05) was used where GCP was simulated to a random sample. Both tests were done using different groupings of reactions, all six classes (I, R, MR, MS, S and HS), three classes (I + R, MR + MS, S + HS) and two classes (I + R + MR and MS + S + HS).
Modeling of the reaction to scald disease. The machine learning approach using the obtained SRT and APS disease screening data was used to find out the best models that link adaptive traits, environments (and associated selection pressures) with genebank accessions. We used environmental data from WorldClim1 databases as predictors. WordClim is an open access database providing global climatic layers describing past climatic profiles of collection sites and intended for spatial modeling or mapping.
It includes average, monthly minimum and maximum temperatures, precipitation, and bioclimatic variables 86 .
The following machine learning algorithms were used: K-nearest neighbors (KNN) 87 , Support Vector Machines (SVM) 88 , Random Forest (RF) 89 , and Bagged Carts (BCART) 90 . Each machine learning model was tuned to select best tuning parameters using a training set (70% of the total set) and then the best model was selected between different machine learning models based on several metrics including accuracy, specificity, and Kappa. The modeling metrics were computed on the test set (30% of the total set). In this study, R language and caret library were used for machine learning 91 . Models were tuned for parameter's optimization and trained on 70% of the data and tested with 10 cross validation folds and 100 replications. Modeling was done for the two isolates for the seedling stage using average and high scores separately. For the adult plant reaction, modeling was done for the combined multi-locations data sets and for each location separately.

Statements of compliance.
The corresponding author attests that: The plants were regenerated and handled using the standard operating procedures set by ICARDA genebank following the FAO genebank standards. All accessions used in this study were acquired from ICARDA genebank in accordance with the International Treaty on Plant Genetic Resources for Food and Agriculture.

Statements of consent.
The corresponding author attests that: The seeds and plants were planted and evaluated with the consent and mutual understanding between ICARDA genebank and the Ethiopian Institute of Agricultural Research (EIAR). Standard operating procedures set by ICARDA genebank and EIAR were followed. We did not collect any barley leaves for this study. All the barley accessions were obtained from ICARDA genebanks using Standard Material Transfer Agreement.

Data availability
The data that support the findings of this study are stored in the ICARDA genebank database and can be made available upon request from the corresponding author.  www.nature.com/scientificreports/