Modeling risk of Sclerotinia sclerotiorum-induced disease development on canola and dry bean using machine learning algorithms

Diseases caused by the fungus Sclerotinia sclerotiorum are managed mainly through fungicide applications in canola and dry bean. Accurate estimation of the risk of disease development on these crops could help farmers make spraying decisions. Five machine learning (ML) models were evaluated in classification and regression modes for predicting disease establishment under different air temperatures and leaf wetness duration conditions. Model algorithms were trained and tested using 20-fold cross validation. Correspondence between predicted and observed values were measured using Cohen’s Kappa (classification) and Lin’s concordance coefficients (regression). The artificial neural network (ANN) algorithms had average accuracies ≥ 89% (classification) and R2 ≥ 88% (regression) on canola and dry bean and their correspondence agreements were ≥ 0.83, which is considered substantial to almost perfect. In contrast, logistic regression algorithms had accuracies of 88% for dry bean and 78% for canola; other models were similarly inconsistent. Implementation of ANN models in disease warning systems could help farmers with spraying decisions. At the same time, these models provide insights on temperature and leaf wetness requirements for development of S. sclerotiorum diseases in these crops. Results of this study show the potential of ML models as tools for epidemiological studies on other pathosystems.


Results
Disease incidence. The variances of canola and dry bean trials were homogenous (P = 0.8348 and 0.7251, respectively) and therefore, a combined analysis within each crop was conducted. The analysis of variance for canola (Table 1) indicated the interaction between wetness duration and incubation temperature was not significant (P = 0.1768) but the main effects of both factors were significant (P < 0.0001). The optimum incubation temperature was 25 °C with an average incidence of 88%. Reducing the incubation temperature to 15 or 20 °C lowered the incidence by almost 20% while increasing it to 30 °C reduced incidence by 43% ( Table 2). The Table 1. Analysis of variance of the effect of incubation temperature and interrupted leaf wetness period and of interrupted leaf wetness on diseases incidence caused by Sclerotinia sclerotiorum ascosporic infections on canola and dry bean, respectively. Analysis was conducted using the GLIMMIX procedure of SAS (version 9.4). The studies were conducted for 10 and 8 days, respectively. www.nature.com/scientificreports/ analysis of variance for dry bean data (Table 1) indicated the interaction between the evaluated wet and dry periods was not significant (P = 0.2738) but the main effects of both factors were significant (P < 0.0001). Incidence increased significantly (α = 0.05) when the length of the wet incubation period increased to 16 h (Table 2) but increasing it from 8 to 12 h did not result in significant increases. Similarly, extending the dry incubation period from 18 to 24 h resulted in an approximately 30% reduction in incidence but extending it from 12 to 18 did not affect incidence (Table 2). Providing constant wet conditions, without dry periods, led to the second highest disease incidence at 78%.

Classification analyses.
For canola, the ANN and SVM models showed the highest accuracy and precision followed by RF (Table 3, Supplementary Table S1). When compared to LGR, the ANN and SVM models were 11 to 10% more accurate and 12 to 11% more precise, respectively. However, all three models had similar recall percentage which ranged between 91 and 92%. The ANN and SVM models had F-scores of 91% while LGR had a score of 85%. The RF model had greater accuracy, precision, and F-score than LGR but similar recall values. Figure 1 shows the probabilities of disease development predicted by ANN using temperature, leaf wetness, and total time from the inoculation as predictors. For dry bean, ANN performed consistently better than LGR by showing greater accuracy, precision, recall and F-score (Table 3, Supplementary Table S1). The SVM model was slightly superior to LGR; however, in contrast to Table 2. Main effects of discontinuous leaf wetness duration and incubation temperatures on incidence (%) of foliar lesions caused by Sclerotinia sclerotiorum ascosporic infection on canola and dry bean plants. On canola, a successive wet and dry period adds to a cycle of 24 h; in dry bean, the cycle does not necessarily add to 24 h. Incidence values are least square means that represent 24 and 34 observations on canola and dry bean plants, respectively. Incidence was measured after 8 and 10 days of incubation of canola and dry bean plants, respectively. Incidence means followed by same letters in a factor are not statistically different (α = 0.05) from each other according to the Tukey-Kramer test. A "-" indicates levels of the factor were not tested.  Table 3. Evaluation of fitness of artificial neural networks (ANN), support-vector machine (SVM), random forest (RF), decision trees (DT), and logistic regression (LGR) machine-learning models used in classification analyses of canola and dry bean data sets that associated incubation temperature and duration of leaf wetness conditions with incidence of Sclerotinia stem rot disease. AUC represents the area under the receiver operating characteristic curve.

Regression analyses.
In the canola study, the ANN algorithm was superior to all other models with higher R 2 , and smaller root mean square error (RMSE) and mean absolute error (MAE) values (Table 4). RF was identified as the second-best model followed by DT. The LNR and SVM models provided a significantly lower fit to the data, explaining 31 to 35% of the variation in disease incidence. Visual association between actual and predicted incidence values are represented on Table S3. ANN also was superior to all models in the dry bean data set with R 2 of 95% and smaller RMSE and MAE values ( Table 4, Supplementary Table S3). The LNR and SVM models fit the dry bean data much better than the canola data, but both were still outperformed by the DT and RF models. When model predictions were compared to actual outcomes, ANN was clearly superior to all other models with a moderate to substantial agreement between observed and predicted values 33 as indicated by average Lin's concordance coefficient of 0.94, 0.98 for canola and dry bean, respectively. The second-best models were RF (ccc of 0.87 and 0.95) and DT (ccc of 0.86 and 0.94) for canola and dry bean, respectively. LNR and SVM had lower coefficients (

Discussion
As pointed by Skelsey 34 , application of ML in agriculture has been overwhelmingly oriented towards recognition of images, whether it is of weeds, fruits, flowers, or of plant diseases [35][36][37][38] , with very few applications being made on the estimation of the risk of disease development. One of the first publications on the latter area was made by Kaundal 22 who compared the conventional multiple regression to the generalized regression neural networks, and SVM to predict rice blast. In their work, Kaundal et al. 22 observed that SVM produced a more accurate www.nature.com/scientificreports/ algorithm than the other methods. To the best of our knowledge, this is the first time that five ML models are compared for their efficacy to predict development of Sclerotinia-induced diseases on canola and dry beans. The flexibility and versatility of ML models were in evidence in the study reported in this manuscript. ANN models were more efficient than LGR when used either in classification or regression modes and in general provided substantial to almost perfect levels of correspondence between observed and predicted events 32,33 .
LGR is considered the tool of choice for development of predictive plant disease risk models [12][13][14][15][16] . The superiority of ANN over LGR was highlighted earlier by Dreiseitl and Ohno-Machado 39 who indicated that in 36 of 72 instances models developed using ANN outperformed those from LGR, while LGR models were superior to ANN models only in five instances. A similar result was observed by Paul and Munkvold 24 , who used it to model the association between gray leaf spot severity using multiple environmental and cultural factors. Further, Chakraborty et al. 20 reported the application of ANN in modeling the relationship between severity of anthracnose of Stylosanthes scabra and several weather variables; their results showed that the best ANN model had the accuracy of > 85%. In the present study, the fitness and accuracy of two other ML models, SVM and RF, were a close second in classification and regression mode analyses, respectively. SVM produced superior models than LGR for the canola set although the LGR models were better than SVM on the dry bean set. Using classification data, SVM produced models with 88% and 90% accuracy in canola and dry bean, respectively, while RF produced models with 85-86% accuracy in both data sets. The high performance of SVM and RF algorithms also have been reported by other researchers; for example, Mehra et al. 23 reported an RF algorithm with accuracies of 93% and R 2 of 79% that modeled Stagonospora nodorum blotch of winter wheat, while Wen et al. 25 reported another for soybean rust that explained 76 to 87% of the total variation in spore movement.
Environmental factors such as temperature, relative humidity, precipitation, wetness duration, and wind speed have a determinant role on development of plant diseases and those caused by S. sclerotiorum are no exception. In Table 4. Statistical fitness metrics of artificial neural networks (ANN), support-vector machine (SVM), random forest (RF), decision trees (DT), and linear regression (LNR) machine-learning models used in regression analyses of canola and dry bean data sets that associated incubation temperatures and duration of leaf wetness conditions to incidence of Sclerotinia stem rot disease.  Table 5. Concordance coefficients for classification (Kappa) and regression (Lin's ccc) models for correspondence between observed and predicted outcomes of artificial neural networks (ANN), supportvector machine (SVM), random forest (RF), decision trees (DT), logistic regression (LGR), and linear regression (LNR) machine-learning models used to characterize the effect of leaf wetness and incubation temperature on incidence of Sclerotinia stem rot of canola and dry bean. www.nature.com/scientificreports/ this study, the impact of interrupted leaf wetness was explored. The range of temperatures evaluated in this study could not be considered extreme since incidences at 10 °C and 30 °C ranged between 28 and 50%. It is likely that the lack of significant interactions between leaf wetness and incubation temperature or length of the dry period is due to the resiliency of the pathogen to desiccation once it is in plant tissues 40 . Multiple prediction models for Sclerotinia diseases have been developed using weather components; for example, Mila et al. 15 studied the role of air temperature and precipitation on the probability of SSR disease prevalence on soybean using LGR models. In their study, they used monthly means for both variables obtained from 320 weather stations distributed on four states and produced models with high explanatory powers that were like the ones reported for LGR models in our study. However, the predictive accuracy of ANN models was 11% and 4%, greater than the LGR models for canola and dry bean, respectively. Harikrishnan and del Río 14,40 conducted studies on white mold disease of dry bean under growth chamber and field conditions and produced models with high explanatory power that ranged between 65 and 91% accuracy. Clarkson et al. 41 studied the association between air temperature, relative humidity, and ascospore density on disease development on lettuce. The role of wind, air temperature and relative humidity on development of white mold on soybean was investigated by Willbur et al. 42 , who produced models with 81.8 to 87.9% accuracy. These tools can assist farmers with spraying decisions. To make these models available to farmers, they could be incorporated into disease-warning systems like the Sporecaster which became available to Michigan soybean growers in 2018 43 or the Sclerotinia risk map that is available to North Dakota farmers 44,45 .
The ANN model presented in this study accurately predicted disease development using environmental factors, e.g., wetness duration and/or temperature, and did it consistently in classification and regression modes, and for two different crops, canola, and dry bean. This flexibility is an indication that ML techniques could be used to model other stages of the disease progress such as apothecial development, spore dispersal, and infection process. The models developed for each stage could then be merged to generate a comprehensive forecasting system. However, model validation prior to full implementation should be conducted under different environmental conditions and locations to ensure its reliability.
In summary, results of this study highlight the potential of ML methods for the development of models that evaluate risk of plant disease development. ANN could predict disease development with high accuracy in classification and regression analyses on both crops, whereas the accuracies of other models, including LGR, were affected by the crop, the type of analysis, and the predictors used. Nevertheless, other ML techniques, e.g., RF, have produced excellent models in other pathosystems 21,23 . Thus, it could be said that more modern ML techniques albeit more complex than LGR, may be described as the "next generation" tools for modeling the risk of plant disease development.

Materials and methods
Data collection. Seed samples from the canola and dry bean cultivars used in this study were obtained from the respective NDSU breeding programs. Westar is an open-pollinated canola cultivar that was released in 1987 in Canada 46 . Westar is no longer available for commercial production but is routinely used as susceptible control in S. sclerotiorum trials in our program. 'Maverick' is a dry bean cultivar released by the NDSU dry bean breeding program in 1997 47 . Maverick is still commercially available and because of its susceptibility to S. sclerotiorum, it also is used as susceptible control in our trials. Since these materials were/are commercially available, voucher specimens were not deposited in publicly available herbaria. Use of these cultivars for research purposes is neither restricted nor regulated in any form by relevant institutional, national, and international guidelines and legislation.
Canola. Detailed process of inoculation and data collection is described by Shahoveisi and del Río Mendoza 16 . Briefly, canola flowers were collected from Sclerotinia stem rot-susceptible cultivar, Westar. Flowers were inoculated with dry ascospores of S. sclerotiorum isolate WM031 by placing them in the upper two layers of an Andersen spore sampler 48 and then activating the sampler above groups of mature lab-produced apothecia. A total of ten inoculated flowers were placed on leaves of a canola plant per replication. Inoculated plants were incubated in different combinations of alternating wet and dry conditions in a period of 24 h (i.e., 6/18, 10/14, 14/10, and 18/6 of wet/dry hours). Experimental units were arranged using a randomized complete block design with three replications and the entire study was conducted twice. Plants were placed in closed plastic bags to maintain the leaf wetness for the required period and then opened to allow drying of the plants. At the end of the dry incubation period, plants were sprayed with distilled water and the plastic bags were closed. Disease incidence, expressed as the percentage of inoculated flowers that formed leaf lesions, was recorded at the end of each wet cycle starting 24 h after plant inoculation until 10 days post inoculation.
Dry bean. Dry bean flowers, cv. Maverick, were collected and inoculated with isolate WM030. The same inoculation method described for canola was used in this experiment. Five flowers were placed on the primary leaves of a dry bean seedling at the V-1 stage. These plants were subjected to alternating wet and dry incubation conditions but a wet and dry period did not necessarily add to a 24 h cycle (i.e., 8/12, 8/18, 8/24, 12/12, 12/18, 12/24, 16/12, 16/18, 16/24, and 24/0 of wet/dry hours). Each treatment was replicated four to five times in a completely randomized design with a single plant per replication. The study was conducted four times. Moist chambers set at 20 °C were used for wet incubation periods. Plants were transferred into a room set at 18 °C with relative humidity at 30 ± 5% at the end of each wet period. Starting the wet condition, leaves were sprayed with water until runoff. Disease incidence was recorded each time that plants were returned to moist conditions until 8 days post inoculation. www.nature.com/scientificreports/ Data analyses. Disease incidence. Maximum disease incidences, measured at the end of each study on both crops, were used to estimate the mean incidence for each treatment. Since the study conducted on dry bean plants was not a full factorial, the continuous wetness treatment was not included in the analysis of variance for incidence. The canola study was a full factorial. Homogeneity of variances of trials in each study was tested using Levene's test at α = 0.05. Then combined analyses of variances were conducted using the GLIMMIX procedure of SAS software (version 9.4; SAS Institute, Cary, NC) where treatments were considered fixed effects and trials, replications, and their interactions with treatments were considered random effects. Tukey-Kramer mean separation test at α = 0.05 was conducted to compare the least square means of treatments.
Classification analyses. All statistical analyses were conducted using Orange software suite (version 3.24.0; University of Ljubljana, Slovenia). Supplementary Figure S1 represents the summary of the workflow system used by Orange 3.24.0 software for classification and regression analyses. For the canola study, temperature, wetness duration, and accumulated wetness time from inoculation were considered independent variables. These predictors were selected because the interaction of interrupted wetness duration and incubation temperature has been studied to a lesser extent 16 . Average disease incidences recorded over the period of the study across all trials and replications for each treatment were considered as the dependent variable (N = 131). For the dry bean study, wetness duration, dry period, and accumulated time were used as the independent variable and average disease incidence was the response variable (N = 60). Five supervised-learning classification models, ANN, RF, DT, SVM, and LGR were evaluated for their ability to model disease development. The training and testing of the models were conducted using 20-fold cross validation. Binomial datasets of the dependent variable were generated where incidence values less than or equal 20% were labeled as 0 to indicate no disease development while incidence values greater than 20% were labeled as 1 to account for disease development. This threshold was selected because incidences below this level do not reduce yields significantly 14 . For the ANN model, one hidden layer with ten and five neurons and hyperbolic tangent, tanh, activation function, that adds non-linear property to the function, were used on canola and dry bean data sets, respectively. The numbers for maximum iterations and learning rate (α) were set at 2000 and 0.5 for canola and 2000 and 0.7 for dry bean, respectively. For the DT model, no limits were set to the tree depth and to the minimum number of instances in leaves (pruning). The node splitting was stopped after a 95% majority threshold was reached. For the RF model, 5 and 16 trees for canola and dry bean datasets were used, respectively. For the SVM model, radial basis function (RBF) kernel with numerical tolerance of 0.0001 and unlimited iteration were used for both crops. Minimum misclassification rate was obtained when the coefficient for the loss function (C) was set at 110 and 0.8 for canola and dry bean, respectively (this difference was due to sample size and data characterization). The epsilon (ε) value of 1 was used for canola while this threshold was 0.9 for dry bean dataset.
LGR uses a logistic function to classify the data into binary values. Using one to several predictors, probabilities of an event, such as disease development, is calculated by this algorithm 16,49 . Two parameters of LGR algorithms are regularization function and cost strength that is the inverse of regularization parameter (λ). In both studies, regularization function was obtained using Ridge regression (L2) with cost strength of 5 and 50 for canola and dry bean, respectively. Table 6 summarizes the type of analysis and parameter estimates used for development of each model.

Regression analysis.
Fitness of ANN, RF, DT, SVM, and LNR algorithms to the canola and dry bean datasets was evaluated. In both studies, average disease incidence across all trials and replications were used as the outcome. The training and testing were conducted using 20-fold cross validation. The performance of the algorithms was compared using their R 2 , RMSE, and MAE. R 2 represents the percentage of the variation in the dependent variable that is explained by the predictors, RMSE indicates the model's average error in prediction of the response variable for an observation, and similarly MAE corresponds to the prediction error of the model. ANN. In the canola analysis, a hidden layer with 200 neurons were used and the activation function was set to tanh. Also, the maximum iterations were set to 2000 and the learning rate to 1. Learning rates closer to 1 result in a more radical weight modification. For dry bean, model parameters were as following: two hidden layers each with 20 neurons, logistic activation function, learning rate of 1, and maximum iteration value of 2000 (Table 6).
DT. Similar to the classification analysis, in both canola and dry bean regression studies, data splitting continued until the majority threshold reached 95% level. In addition, we held no limits on the maximal tree depth as well as the minimum number of splits in leaves.
RF. For canola and dry bean studies, 10 trees were used; and maximal number of considered features and maximal tree depth were set as "unlimited" (Table 6).
SVM. This algorithm was run using linear kernel function, with numerical tolerance set to 0.0001, and unlimited iterations. Minimum misclassification rate was obtained when penalty rate parameter was set at 1. A threshold of 0.90 was used for ε parameter in both studies (Table 6). LNR. This model is the simplest ML algorithm that describes the linear relationship between the input (independent variable) and the output (dependent variable) 50 . In disease forecasting studies, multiple LNR models allow analyzing the regression between multiple predictors and the response variable. In both canola and dry bean analyses, the regularization parameter (α) was set at 1 ( www.nature.com/scientificreports/ Model comparisons. To estimate the overall fitness of each model, confusion matrices of the models were obtained and metrics including area under the receiver characteristic curve (AUC), classification accuracy (CA), precision, recall, and F1 score were calculated. AUC indicates the ability of the model to distinguish the classes. A higher AUC represents a more accurate model in separating 0 and 1 classes. CA is the ratio of correct predictions (true positive and true negative) to the total number of predictions. Precision is calculated by dividing "true positive" by "total predicted positive" whereas recall is ratio of "true positive" to "total actual positive". The recall proportion reflects the percentage of instances where the model correctly identified a case [true positive/(true positive + false negative)]. The F-score provides a harmonic mean between precision and recall and in a way, describes the overall accuracy of a model. The F1 score is calculated by the following formula: When expressed as percentage, the closest the precision, accuracy, recall, F1 score, or AUC, is to 100, the better the model is. To further visualize the relationship between actual and predicted incidences from each model, plots were produced and added as Supplemental Materials.
Model validation. To evaluate the predictive ability of each model, a bootstrapping procedure with replacement was implemented. The levels of correspondence between predicted and actual values for each model were estimated using the Kappa statistic 51 for classification data and Lin's concordance correlation coefficient 52 for regression models. The Lin's ccc and Kappa statistic were calculated for each crop separately. . Table 6. Parameter estimates of artificial neural networks (ANN), decision trees (DT), random forest (RF), support-vector machine (SVM), logistic regression (LGR), and linear regression (LNR) machine-learning models used in classification and regression analyses.
LGR was used in classification and LNR in regression analyses.