Abstract
Sugarcane is an economically important crop, but its genomic complexity has hindered advances in molecular approaches for genetic breeding. New cultivars are released based on the identification of interesting traits, and for sugarcane, brown rust resistance is a desirable characteristic due to the large economic impact of the disease. Although marker-assisted selection for rust resistance has been successful, the genes involved are still unknown, and the associated regions vary among cultivars, thus restricting methodological generalization. We used genotyping by sequencing of full-sib progeny to relate genomic regions with brown rust phenotypes. We established a pipeline to identify reliable SNPs in complex polyploid data, which were used for phenotypic prediction via machine learning. We identified 14,540 SNPs, which led to a mean prediction accuracy of 50% when using different models. We also tested feature selection algorithms to increase predictive accuracy, resulting in a reduced dataset with more explanatory power for rust phenotypes. As a result of this approach, we achieved an accuracy of up to 95% with a dataset of 131 SNPs related to brown rust QTL regions and auxiliary genes. Therefore, our novel strategy has the potential to assist studies of the genomic organization of brown rust resistance in sugarcane.
Similar content being viewed by others
Introduction
Sugarcane is an important source of income worldwide, especially due to its efficiency in the manufacturing of biofuel and sugar-related products in most tropical and subtropical areas of the world1,2. Although this crop has great energetic potential, its breeding process has generated high genomic complexity across bred varieties, exceeding that of most if not all other crops3. Modern sugarcane cultivars are derived from a process of hybridization that has occurred over a century between Saccharum spontaneum (\(2n=5x=40\) to \(16x=128; x=8\))3,4 and Saccharum officinarum (\(2n=8x=80\), \(x=10\))3,4. S. officinarum has a more efficient process of sugar production but is susceptible to several biotic and abiotic stresses, in contrast to S. spontaneum, which has a low sucrose content but is resistant to different types of stress1,3,5. Sugarcane cultivars have unique chromosome sets (with numbers ranging from 80 to 130)6 with highly complex genomic organization1, a polyploid genome (with overall ploidy estimated to be between 6 and 14)7, a frequent occurrence of aneuploidy at the locus level depending on the number of homologous chromosomes in hybrid cultivars8, an estimated whole-genome size of 10 Gb9, and a high content of repetitive regions (50% of genome size)10. This complexity has challenged the efforts of the scientific community to unravel the genetic architecture of sugarcane in terms of the molecular mechanisms underlying different phenotypes, particularly efforts to detect regions of phenotype–genotype associations.
Sugarcane breeding programs are implemented with the intention of releasing new cultivars with interesting agronomic traits, including disease resistance11. One disease with a large impact on sugarcane yield is brown rust, which is caused by Puccinia melanocephala, a fungus that affects foliage and decreases the photosynthetic capacity of sugarcane12,13. Brown rust infections have already caused large economic losses14,15,16. However, disease control has been shown to be successful in sugarcane breeding17, and the planting of cultivars resistant to brown rust is considered the most effective method of controlling this pathogen11,12. Based on comparisons of the genetic characteristics of the resistant cultivar R570 and other sugarcane varieties17, brown rust resistance was found to be a dominant trait controlled by one or a few genes11,18, with the presence of two related major genes: Bru119 and Bru220. Bru1 has already been employed in different breeding programs to identify resistant sugarcane genotypes5, using, for instance, the presence of flanking molecular markers for resistance diagnosis across cultivars12.
Although there have been several advances in understanding brown rust susceptibility in sugarcane, it is important to consider that pathogens may overcome the resistance of sugarcane varieties, and the use of a single region for resistance examination further increases the probability of vulnerability5. Therefore, the exploration of novel genes could contribute to the understanding of this process and in turn overcome the problems associated with reliance on a single gene21. An appropriate strategy for unraveling the genetic architecture and genomic organization of brown rust resistance would be the use of linkage maps followed by quantitative trait locus (QTL) identification. However, existing methodologies for the construction of saturated linkage maps with high resolution are limited for aneuploid species, such as sugarcane5,22,23. Using simplification strategies based on the population expected segregation ratio, such as the selection of a subset of single-dose markers, leads to impaired linkage groups and thus compromises the identification of reliable QTLs22. A linkage map depicting QTLs associated with brown rust resistance has been published5, but as observed in previous studies11,24, adjustments of existing methods resulted in gaps, a poorly saturated map and a large number of unlinked markers, mainly due to the high probability of meiotic behaviors in the cultivars and the aneuploidy of sugarcane7,22. Different software programs have been developed to build linkage maps for polyploids22,25,26,27; however, none address sugarcane genomic organization.
The use of Bru1 for marker-assisted selection (MAS) represents a successful application of this methodology in some sugarcane varieties28. However, resistance differs among cultivars, which can restrict the application of validated linked markers as a general tool for MAS21,28. Therefore, the identification and characterization of brown rust resistance genes in sugarcane have been slow14, mainly because selection approaches based on QTL mapping overestimate the effect of strong QTLs, while weak QTLs might not be identified29,30. In general, these methodologies have low power to detect rare variants with phenotypic associations31. Methodologies for addressing sugarcane genomic characteristics are still lacking, and because of the difficulty of accurately selecting QTL regions for MAS, an alternative methodology known as genomic selection (GS) has been developed to identify promising varieties with resistance traits and improve sugarcane breeding programs in terms of time and cost32, 33.
In general, GS is based on the creation of a predictive model for breeding values built with the entire set of markers using a training and a testing population. This model might be posteriorly applied in a breeding program to select a set of promising individuals33. In sugarcane breeding programs, the selection of superior genotypes might take more than 12 years34, and GS represents an alternative for improving this process, accelerating the breeding cycle and reducing the time needed to generate diversity31,33,35. Due to sugarcane’s genomic complexity, simplified predictive models involving linear regression cannot capture the unknown nonlinear characteristics present in these datasets31, as described for other polyploid species36,37,38. To address this issue, machine learning (ML) methodologies represent a promising approach with high accuracy31,39,40,41. Although GS was developed to address the problem of categorizing individuals using different populations, its application in biparental populations is suitable and might be highly efficient due to the significant amount of linkage disequilibrium between loci42, which would facilitate the initial cycles of breeding programs.
In sugarcane, the allele dosages (ADs) of a locus are frequently unknown7, which might lead to misclassified genotypes. These difficulties in genotyping a population directly impact the estimation of locus effects on model creation43, and this influence is more complex when using nonlinear models with more parameters to be estimated44. An alternative for dealing with erroneous features and additional restrictions for high-dimensional data is feature selection (FS). These techniques aim to reduce the number of single nucleotide polymorphisms (SNPs) in a data set and identify a subset of markers with higher predictive capability by removing markers that are irrelevant/redundant for the phenotype45. These methods are among the most powerful alternatives for building better generalization models46 while avoiding overfitting and the attribution of nongenetic effects to different markers43. With FS, it is possible to reduce marker density and build simpler and more comprehensive models46, thereby increasing predictive power due to the identification of phenotype-associated polymorphisms. A few previous studies applied ML methods to decrease the number of SNP datasets needed for phenotypic predictions47,48,49, achieving high accuracy. The identification of such a subset of putative causal polymorphisms is crucial for improving production in plants42 and represents a novel strategy for genomic prediction in sugarcane.
Therefore, the objectives of this research were as follows: (1) genotyping a sugarcane full-sib population using a genotyping by sequencing (GBS) protocol50 followed by an established bioinformatics pipeline to identify reliable SNPs considering the sugarcane aneuploid condition; (2) creating a ML-based strategy to establish a subset of SNPs with good ability to predict brown rust phenotypes; and (3) examining these polymorphic regions to identify genes and QTL regions. Our study provides a novel methodology that can assist in sugarcane genetic studies and breeding programs to establish a pipeline to infer phenotype-causative regions, which can help unravel sugarcane brown rust resistance molecular mechanisms and identify targets for breeding.
Material and methods
Mapping population and phenotypic characterization
A set of full-sib progeny composed of 219 individuals derived from a biparental cross between the elite clone IACSP95-3018 (female parent) and the commercial variety IACSP93-3046 (male parent) was developed by the Sugarcane Breeding Program at the Agronomic Institute of Campinas (IAC). IACSP95-3018 is a promising clone that is used in breeding programs but is susceptible to brown rust. IACSP93-3046 is a variety with good tillering, an erect stool habit and resistance to brown rust. These parents have already been used in transcriptome51 and mapping studies24,52.
The progeny phenotyped for brown rust symptoms were planted in 2005 at the Sugarcane Breeding Center of the Instituto Agronômico (IAC) located in Ribeirão Preto-SP, Brazil, and again in 2011 in Piracicaba-SP, Brazil, in an augmented block design with five blocks, each containing 44 individuals, plots with 1-m rows and plants spaced 1.5 m apart. Both parents and two varieties (SP81-3250 and RB835486) were included in each replicate as controls. The level of brown rust infection was evaluated using a diagrammatic scale between 1 and 9, with larger values indicating larger percentages of leaf area infection53. In Ribeirão Preto, four evaluations were performed: (1) November 2005 (plant cane), (2) January 2006 (plant cane), (3) January 2007 (ratoon cane), and (4) March 2007 (ratoon cane). In Piracicaba, the evaluations were conducted in December 2011 (plant cane) and in February 2012 (plant cane).
Phenotypic data analyses
The phenotypic analyses of brown rust were performed using R statistical software54 following a statistical mixed model:
where \(Y_{ijrkm}\) is the phenotype of the ith genotype, considering the jth block, the rth replicate, the kth location and the mth year of harvest. The trait mean is represented by \(\mu\); the fixed effects were modeled to estimate the contributions of (1) the kth location (\(L_k\)), (2) the mth harvest (\(H_m\)), (3) the jth block at the kth location and in the mth harvest (\(B_{j(km)}\)), and (4) the interaction between the kth location and mth harvest (\(LH_{km}\)). The random effects included genotype G and the residual error e, representing nongenetic effects.
The residual distribution was evaluated using quantile–quantile (Q-Q) plots together with a Shapiro–Wilk normality test (p-value < 0.05). We also tested normalized values of the brown rust trait created with the R package bestNormalize55. To analyze the contribution of genotype to phenotype, we used best linear unbiased predictions (BLUPs) calculated based on the mixed model described above using the R package breedR v.0.1256. Heritability (\(H^2\)) was estimated as \(H^2=\sigma ^2_g/\sigma ^2_p\), where \(\sigma ^2_g\) is the genetic variance and \(\sigma ^2_p\) is the phenotypic variance (genetic, environmental and residual variances).
With these predictions, cluster analysis was performed with the BLUP values. We used complete hierarchical clustering based on pairwise Euclidean distances for visual inspection. The number of appropriate clusters was identified using the K-means algorithm together with (1) the within-cluster sums of squares and (2) the average silhouette width of clusters, implemented in the R package factoextra v.1.0.657. To evaluate the differences among the phenotypic rust groups, we used T-tests of the BLUPs and original values.
Library preparation and sequencing methodology
Total genomic DNA samples from parents and 180 progeny were extracted from leaf roll using the CTAB protocol58. Genome complexity was reduced via the PstI restriction enzyme for library preparation50. We constructed two 9648-plex libraries from the population consisting of a single sample of each individual, two replicate samples of each parent and one blank sample. Five sequencing runs were performed with the Illumina GAIIx (one in 2015) and Illumina NextSeq (four in 2017 divided into two groups, which were sequenced twice) systems.
Quality filtering and demultiplexing
PhiX sequences were removed from GBS reads through alignments of raw reads against the PhiX genome using BLASTn59. Reads resulting in a minimum percent identity of 90% and e-value of 0.01 against PhiX regions were filtered out60. FASTQC61 was used for the initial visualization of nucleotide distributions and their respective qualities, and FastX-Toolkit scripts62 were employed to obtain 90-bp reads with a minimum of 80% of bases with a Q greater than 20. Sample demultiplexing was also performed using the FastX-Toolkit62.
Read alignment and reference evaluation
We used the BWA-MEM version 0.7.1263 and Bowtie2 version 2.3.3.164 algorithms to align the filtered reads against the following references: (1) the methyl-filtered (MF) genome of sugarcane cultivar SP70-114365, (2) the sorghum genome from Phytozome v.1366, (3) a sugarcane leaf transcriptome51, (4) the draft genome of the hybrid SP80-328067, (5) the monoploid genome of the R570 variety3, (6) the S. spontaneum genome split into four subsets on the basis of the allele-defined genome, and (7) sequences from the sugarcane expressed sequence tag project (SUCEST)68.
The performance of each mapping software tool was evaluated according to the percentage of uniquely mapped reads. The individual number of uniquely mapped reads across the population was also analysed and assessed using an alluvial plot for visualization. Individuals with small numbers of reads and high discrepancies with most others (at least a decrease of 70% in the number of reads) were not considered for further analysis. In order to identify the most appropriate reference for SNP calling in sugarcane, we examined the following aspects together: (1) the quantity of uniquely mapped reads; (2) the profiles of sequencing depth across loci; (3) the contiguity of consensus sequences obtained through the read alignments; (4) the capability to comprise the largest amount of consensus sequences formed by the other references; and (5) the number of SNPs identified.
For (1), we used the results from the most promising mapping tool. SAMtools version 1.669 was used to obtain the profiles of sequencing depth across loci69 in order to examine (2). The consensus sequences (contigs) formed by mapping reads to the different references were retrieved using Stacks version 2.370, and, for aspect (3), we calculated traditional assembly metrics (number of contigs, largest contig, total length, quantity of ambiguous bases (Ns) per 100 kbp, N50/75 and L50/L75) for all the contigs separated by reference using QUAST version 5.0.271. An evaluation of the raw reference sequences was also performed using QUAST version 5.0.2. Additionally, for (4) all contigs were evaluated on the basis of their similarity to the consensus sequences of the other possible references. All correspondences were counted, taking into account the quantity of related contigs. These alignments were obtained through BLASTn59 with stringent parameters to examine real redundancies (a minimum e-value of 1e−30, a minimum percent identity of 95% and coverage of at least 75% in the query sequence). We also used the R package circlize72 to visually inspect these redundancies.
SNP calling and ploidy evaluation
For the last step of reference comparison, we executed the Tassel4-POLY73,74 and Stacks version 2.370 pipelines. We evaluated raw SNPs by comparing them to a dataset with a filter criterion of a maximum of 25% of missing data per locus, considering individual genotypes without a minimum count of 50 reads as missing data.
Together with the Tassel and Stacks results, the best selected reference was used to identify variants through the Haplotype Caller algorithm implemented in Genome Analysis Toolkit (GATK) version 3.775, SAMtools version 1.669 and FreeBayes version 1.1.0-376. We created a common dataset to be processed by these tools, establishing a pre-processing pipeline according to GATK best practices75. From the mapping results, uniquely mapped reads were selected using SAMtools version 1.669, and with Picard Toolkit77, the following steps were performed: (1) the mapped files from different sequencing experiments were joined into one file per individual; (2) read duplicates were marked; and (3) read group information was added to different files. To produce more accurate results, we used GATK version 3.775 to realign indels and SAMtools version 1.669 to convert mapping formats. Putative SNPs were called using the three different tools and different ploidy configurations with GATK and FreeBayes (even ploidies ranging from 2 to 20). We selected the identified SNPs and evaluated these variants with respect to the quantity of missing data.
Final SNP-set selection and ploidy evaluation
Using the R package VennDiagram v.1.6.2078, a Venn diagram was created to evaluate the intersection between SNPs identified by the callers and those identified by the selected reference. Indels were not used for further analyses. Due to sugarcane aneuploidy at the locus level, we genotyped the individuals on the basis of SNP allele proportions, i.e., the ratio between the number of reads for the reference allele and the total number of reads. To increase the reliability of our results, we selected markers called by Tassel and at least one other caller with a minimum count of 50 reads per individual and a maximum of 25% missing data.
SuperMASSA79 and the VCF2SM pipeline73 were used to estimate the ploidy levels at different loci. Quantitative allele intensities at each locus were estimated for individuals based on read depth73. These values were used to estimate locus ploidies (ranging from 2 to 20). We used the F1 model for population structure due to the usage of a biparental population and did not restrict the posterior probability threshold to capture and analyze all possible configurations produced by the statistical estimate. We also defined the most probable set of loci with a posterior probability greater than 0.8 given the selected ploidy (6 through 14). We compared through treemaps the ploidies estimated for the final dataset of SNPs and for the possible false positives eliminated using the proposed approach.
Machine learning strategies
Using the identified SNPs as ADs and allele proportions (APs), eight ML algorithms were tested to check their ability to predict the phenotypic rust groups. Missing data were imputed as the means. We tested K-nearest neighbor (KNN)80, support vector machine (SVM)81, Gaussian process (GP)82, decision tree (DT)83, random forest (RF)84, multilayer perceptron (MLP) neural network85, adaptive boosting (AB)86, and Gaussian naive Bayes (GNB)87 implemented in the scikit-learn v.0.19.0 Python v.3 module88. As a cross-validation strategy, we used a stratified K-fold (k = 4) repeated 100 times for different data configurations. We evaluated the following metrics: (1) accuracy (proportion of correctly classified items), (2) recall/sensitivity (items correctly classified as positive among the total quantity of positives), (3) precision (items correctly classified as positive among the total items identified as positive), and (4) specificity (items classified as negative among the total negative items). The area under the receiver operating characteristic (ROC) curve (AUC) was also calculated for each model and plotted using the Matplotlib v.2.0.2 library89 with Python v.3.
We also tested FS techniques implemented in the scikit-learn Python v.3 module88. We tested the following approaches to obtain feature importance and create subsets of the marker data: (1) gradient tree boosting (FS1)90, (2) L1-based FS through a linear support vector classification system (FS2)81, (3) extremely randomized trees (FS3)91, (4) univariate FS using ANOVA (FS4), and (5) RF (FS5)84. The best genotyping approach for predicting brown rust phenotypic groups was selected by analyzing each prediction measure and counting the percentage of models with the best performance when using ADs or APs in the different established dataset configurations (all the SNPs and the datasets estimated through FS1, FS2, FS3, FS4, and FS5). For evaluating the prediction capability of each FS technique, we combined the calculated metrics with statistical approaches.
On the evaluation metrics (accuracy, recall, precision and specificity) for each subset of identified markers (FS1, FS2, FS3, FS4 and FS5), we performed a Shapiro-Wilk normality test (p-value < 0.01) and identified confidence intervals for means (95%, 99% and 99.9% confidence) using the gmodels v.2.18.1 R package92 for parametric data and a Wilcoxon test for nonparametric data. We assessed the capability of each FS technique to exceed the confidence intervals for each metric in order to select the most promising strategies. Additionally, for comparing the predictive profiles, we tested the differences in these metrics between the selected FS methods using ANOVA and multiple comparisons by Tukey’s test implemented in the agricolae v.1.3-1 R package93. With the most promising strategies identified, we also evaluated the intersection of these datasets using the R package VennDiagram78.
Functional annotation
From the SNPs identified by the most promising FS technique, we selected the reference positions to which they belonged and extracted the respective region. To check the distribution of these SNPs in the Bru1 region, we selected nine bacterial artificial chromosomes (BACs) from the sugarcane cultivar R570 that were previously described as belonging to regions containing Bru194. These BACs were retrieved from the GenBank database95. We performed comparative alignments of the nine BACs and the selected reference sequences against S. spontaneum1 coding DNA sequences (CDSs) using BLASTn59 with the following parameters: a minimum e-value of 1e−30, a minimum percent identity of 95% and coverage of at least 75% in the query sequence. The distribution of these regions among S. spontaneum chromosomal regions was inferred using the karyoploteR package96.
We created a dataset with CDSs extracted from Phytozome v.1366 for fourteen different species from the Poaceae family (Brachypodium distachyon, Brachypodium hybridum, Brachypodium silvatium, Hordeum vulgare, Oryza sativa, Oropetium thomaeum, Panicum hallii, Panicum virgatum, Sorghum bicolor, Setaria italica, Setaria viridis, Triticum aestivum, Thinopyrum intermedium and Zea mays) and Arabidopsis thaliana. Selected S. spontaneum CDSs were aligned against this dataset, enabling the identification of correspondence with Gene Ontology97 (GO) categories and Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthologies98 (KOs). All identified GO categories were used to create a treemap to visualize possible correlated categories in the dataset caused by identified regions using FS and BAC correspondence. This step was performed using the REVIGO tool99.
Results
Phenotypic analyses
The brown rust phenotypic dataset was analyzed as described in Section 2 (Supplementary Figs. S1–S6) of the Supplementary Information (SI). Using the phenotypic mixed model created, we obtained a heritability of approximately 60%. Through the established statistical analysis procedures, we identified two different phenotypic groups, which were used for association analyses. These groups presented high divergence in scores, and the individuals were classified as belonging to the “resistant” group or the “susceptible” group.
Genotyping process
Raw GBS data quality control (Supplementary Tables S1 and S2), read alignment (Supplementary Table S3 and Supplementary Fig. S7), reference evaluation (Supplementary Tables S4–S9 and Supplementary Fig. S8) and SNP calling (Supplementary Tables S8–S11) were performed as described in the SI. The selected mapping tool was BWA, as it allowed the identification of a larger quantity of uniquely mapped reads. We considered the MF genome the most appropriate reference for SNP calling with our GBS dataset; this choice was made because this reference provided the largest percentage of uniquely mapped reads (Supplementary Table S3), the most consensus sequences and respective profiles (Supplementary Tables S4 and S5), the greatest sequencing depth at different mapping positions (Supplementary Table S6), the greatest ability for its consensus contigs to represent the majority of the other reference consensus sequences (Supplementary Table S7), and the largest quantity of SNPs identified using Tassel and Stacks (Supplementary Tables S8 and S9). The SNP calling process performed with the different tools and MF reference resulted in different quantities of markers, as described in the SI. The quantity of SNPs can be observed in Table 1.
The intersections between SNPs found with different tools can be visualized in Fig. 1. A total of 13,458 SNP markers were found by all used callers. However, when applying a reasonable filter for locus depths (minimum count of 50 per individual) and missing data (maximum of 25%)23, this quantity decreased to 2284. Although this approach of selecting intersecting SNPs enables the definition of a highly stringent set, the quantity of false negatives will also be high. To establish a reasonable approach for SNP identification in sugarcane, we selected the most probable SNPs as the variants found with Tassel and at least one other caller. With this approach, we found 88,395 SNPs (eliminating 49,362 possibly false-positive SNPs uniquely identified by Tassel). After applying filters based on missing data and read depth, we obtained a final set of 14,540 SNPs (eliminating 4341 questionable SNPs among 18,881 markers that would have been obtained using Tassel as a unique tool together with the described filters). These datasets were used to evaluate possible ploidy configurations using SuperMASSA software. This evaluation was performed on (I) 14,540 SNPs representing our final set of markers and (II) 4341 SNPs representing the most likely false-positive SNPs.
Separating SuperMASSA posterior probabilities into three categories (A, B and C) based on their reliability (Fig. 1) classified a considerable number of SNPs as having a specific ploidy with high confidence. However, the first set with 14,540 SNPs included variants with ploidies more similar to those expected for sugarcane (6 to 14)7 compared with the second one with 4341 SNPs. The majority of SNPs in the second set were classified as having a ploidy of 20, representing doubtful regions with chances of duplication events and low-quality data7. Therefore, this group of putative molecular markers with higher reliability provided better results in terms of estimated ploidies.
Phenotype–genotype associations
To understand the genotypic associations with different brown rust phenotypes more generally, we chose to perform genotype-phenotype analyses with the phenotypic rust groups identified in the clustering analysis. We performed these tests using two different approaches for genomic prediction: ADs obtained with SuperMASSA (ploidy range between 6 and 14 and posterior probability greater than or equal to 0.8) and APs calculated based on Tassel output for read counts. With these two different datasets, the FS techniques were applied and generated different sets of SNPs (Table 2).
These SNPs were used to predict the phenotypic rust groups using the eight selected ML algorithms in the proposed cross-validation scenario (described in SI in Supplementary Tables S12–S17). The performance of APs was superior to that of ADs for all evaluated metrics (Supplementary Table S18). In almost 73% of the tests with different algorithms, the usage of APs was equal or superior to that of ADs. Although there were some discrepancies across models and between FS subsets, we observed better use of sugarcane GBS data with APs. In addition, the quantity of SNPs discarded to obtain favorable ADs in sugarcane was almost 64%. Therefore, we considered the analysis of APs better than ADs for the task of GP.
The capability of predicting brown rust phenotypic groups was quite different among the created scenarios. Using the entire dataset, the overall accuracy was near 50%, showing the models’ inefficiency in capturing the real SNP effects. The KNN model presented an accuracy of almost 70% but with a very small value of specificity (0.23%), thus proving its inefficiency in predicting these phenotypes when using the entire set of SNPs. With FS techniques, these values increased but still presented differences between the selected methods. To evaluate the best FS techniques with which to increase predictive capabilities, we determined confidence intervals for all metric means (accuracy, recall, precision and specificity), as shown in Supplementary Table S19. Then, we counted the quantity of measures that exceeded the superior boundaries. FS1, FS2 and FS4 had the best performance, as described in the SI (Supplementary Tables S20–S23). Furthermore, we analyzed the distributions and similarities of these metrics. The accuracy distribution is shown in Fig. 2, and the other distributions are shown in the SI (Supplementary Figs. S9–S12). FS3 and FS5 clearly did not allow a substantial increase in these performance measures. The maximum values in the boxplots for FS3 and FS5 are close to the medians of FS1, FS2 and FS4. In addition, considering that multiple comparisons by Tukey’s test grouped F3, F5 and the initial dataset together, we can conclude that these techniques did not enable substantial improvement in accuracy. Analyses of the other metrics also showed better performance of FS1, FS2 and FS4 than of the other datasets, including FS3, FS5 and the entire set of SNPs. Due to these findings, we considered FS1, FS2 and FS4 the most promising methodologies for detecting variants with high predictive capabilities.
The FS1, FS2 and FS4 methods identified different variants in different scaffolds. However, there were intersections between these sets (Fig. 2), which we decided to evaluate. We tested all selected ML algorithms using the intersection between at least two strategies (Inter 2), which corresponded to 131 SNPs, and the intersection between the three strategies (Inter 3), which corresponded to 6 SNPs. The results obtained using Inter 3 did not increase the metric values of the FS techniques; however, they were far superior to the initial results obtained with the entire dataset (approximately 41% larger), as described in the SI (Supplementary Table S24). Inter 2, however, showed the highest predictive capabilities (Table 3), suggesting that these variants have a greater probability of being associated with brown rust phenotypes.
The tested ML models had different capabilities of separating the phenotypic groups, and these capabilities changed depending on the dataset used. In addition to using the previous metrics, we chose to evaluate model performance using ROC curves and the respective AUCs. All of these plots are shown in the SI together with the AUC values (Supplementary Figs. S13–S20). We evaluated two different configurations to consider a model with reasonable predictive performance: (A) AUC \(\ge 0.8\) and (B) AUC \(\ge 0.9\%\). For (A), we identified AB, GNB, GP and MLP as the most promising models when using the FS1, FS2 and FS4 techniques. When using the entire dataset and FS3, there were no significant changes in performance under (A). GNB was the best model for FS5; GNB and RF were the best models for Inter 3; and KNN, GP, RF, MLP, AB and GNB were the best models for Inter 2. This first configuration enabled identification of the Inter 2 FS technique as the most appropriate for the creation of stable models using ML strategies. The performance of the built models based on Inter 2 is shown by ROC curves in Fig. 3 and contrasted with the results for the entire dataset. For (B), the entire dataset, FS3, FS5 and Inter 3 did not have AUC values exceeding 0.9, supporting the exclusion of FS3 and FS5 as interesting for detecting phenotype-associated variants. GNB was the best model for FS1, and GP, MLP and GNB were the best models for FS2, FS4 and Inter 2. Thus, we considered GP, MLP and GNB the best models for predicting the brown rust phenotypic groups. The ROC curves for these three algorithms and the different subsets are provided in the SI (Supplementary Figs. S21–S23). The best AUC values were (I) MLP: 0.99 for Inter 2 and 1.00 for FS2, (II) GNB: 0.98 for Inter 2 and 0.96 for FS2, and (III) GP: 0.98 for Inter 2 and 0.98 for FS2. This finding supports the hypothesis of an association between Inter 2 regions and brown rust phenotypes. On the basis of these results, we suggest that the identification of intersections between FS1, FS2 and FS3 might be an appropriate methodology for both GP and the identification of regions associated with brown rust phenotypes.
The last analysis that we performed to test whether this methodology was a promising strategy was an evaluation of the genomic regions where the selected variants were located. For this step, we used S. spontaneum CDSs corresponding to (A) 9 selected BACs related to Bru1 QTL regions and (B) 146 MF scaffolds identified as important by at least two methods. We identified 373 CDSs using (A) and 240 CDSs using (B). All BACs of (A) had correspondences, and nine scaffolds of (B) did not have relevant alignments. As there was only one CDS in common between (A) and (B), we evaluated the chromosomal location of these CDSs considering the S. spontaneum genomic reference, which is presented in the SI (Supplementary Fig. 24). Notably, regions where these CDSs were located were spread throughout the genome. However, nearly all CDSs identified in (B) were close to CDSs identified in (A), suggesting linkage disequilibrium between these regions due to chromosomal proximity. Additionally, to understand whether these genomic regions have similar impacts on biological processes, we performed enrichment analysis using the GO categories of these two groups (Fig. 4). We found 148 different GO categories in (A) and 100 in (B), with 50 GOs in common. The other 50 categories identified for only the selected variants can be found in the SI (Supplementary Fig. S25); there were four main categories: (I) sphingolipid metabolism, (II) DNA topological change, (III) nitrogen compound transport, and (IV) phosphatidylinositol-mediated signaling.
In relation to metabolic pathways, we selected the S. bicolor KEGG correspondences for each CDS and also separated these findings into groups (A) and (B). The complete discrimination of the identified pathways is shown in the SI. We found 41 associated pathways in (A) and 29 in (B), with 16 in common between these groups. As expected, there was an elevated number of common biological cascades that might be influenced by these regions. The specific pathways found exclusively in group (B) were monoterpenoid biosynthesis, phenylpropanoid biosynthesis, the pentose phosphate pathway, sulfur metabolism, other glycan degradation, fatty acid elongation, basal transcription factors, ubiquitin-mediated proteolysis, various types of N-glycan biosynthesis, tryptophan metabolism, sphingolipid metabolism, carbon metabolism and N-glycan biosynthesis.
Discussion
The organization of the sugarcane genome greatly challenges genetic studies of this species, and alternative approaches must be employed to overcome these difficulties. Here, we developed a novel strategy to address sugarcane genomic specificities and enable the identification of genomic regions related to brown rust resistance through the evaluation of ML predictive performance. The sequencing method, SNP detection process and phenotypic associations were designed to fit these singularities. Sugarcane brown rust susceptibility was previously studied and applied in sugarcane breeding programs28; however, there is still a gap in the characterization of the wide range of genes involved in the process of infection and how different genomic polymorphisms can influence this phenotype. The adjustments performed on these analyses showed reasonable results, and the identification of these possibly phenotype-causative regions can help unravel sugarcane brown rust resistance molecular mechanisms and the selection of targets for breeding.
First, due to the diversity of rust scores (1–9), the variation in rust phenotypes within populations and the qualitative nature of rust phenotypes, we decided to use the two groups identified by the BLUP clustering analysis instead of the raw scores. We were interested in finding markers and genomic regions related to brown rust resistance, and the establishment of these two major groups enabled the identification of resistance categories in the population. As previously described, these phenotypic rust groups presented a high level of differentiation in rust scores, and this contrast in susceptibility may aid in the identification of the most promising plants for sugarcane breeding programs. In addition, the establishment of these groups allowed the use of a wide range of ML strategies.
In relation to the sugarcane genotyping process, different approaches have been adopted by the scientific community to reduce the genomic complexity of sugarcane and utilize a limited amount of information. Song et al.100, for example, designed different probes using in silico approaches. The resulting regions were posteriorly adopted in other studies101,102,103,104 due to the large quantity of markers with sufficient sequencing depth located in genic regions. Another approach is GBS5,23,28,105,106,107, which is the preferred genotyping method for plants with some degree of genomic complexity23,108 mainly due to its simplicity, reproducibility and considerable genome coverage109. In addition, regulatory regions controlling different phenotypes are often located in noncoding DNA, and GBS allows the amplification of such regions50. Herein, we decided to use GBS to obtain a broader set of genomic regions with their respective probabilities of correspondence with rust resistance.
Sequencing reads are generally organized by using the S. bicolor genome for comparative alignments and the subsequent identification of putative variants with bioinformatic methods. This reference choice is due to sorghum’s phylogenetic proximity to sugarcane5,28,105,106,107 and, in some cases, probe experimental design100,101,102,103,104. Despite sorghum’s genome usage, the availability of sugarcane pseudoreferences has provided new genomic tools for scientific research as initially explored by Balsalobre et al.23, who used the sorghum genome, a sugarcane MF genome65, a sugarcane leaf transcriptome51 and SUCEST tags68. However, new sugarcane genomic resources are now available, such as the draft genome of the cultivar SP80-328067, the monoploid genome of the R570 variety3 and the genome of the AP85-441 S. spontaneum cultivar1, which are phylogenetically closer to current sugarcane cultivar resources than are sorghum resources. Therefore, there is a need to explore these new references and check their appropriateness. As there are no previous reports of the usage of these novel references together with sugarcane GBS data, we decided to test them in order to identify the most appropriate reference.
Although GBS allows a reduction in genomic complexity, we must consider sugarcane singularities to establish an analysis pipeline. In GBS experiments, the consensus of read clusters at cutting sites could be adopted as a reference in cases where there is no appropriate sequence to use50. However, genome assembly is a difficult task when dealing with repetitive regions and polyploids110. With the aim of reducing possible biases, we decided not to use de novo approaches, which were previously described as inappropriate for sugarcane GBS data105.
In our study, the combination of BWA and MF scaffolds had the best performance for GBS data. BWA was previously reported as a sensitive tool for aligning sugarcane reads and retaining a large number of uniquely mapped sequences100. In terms of MF performance, this may be explained by the experimental procedures of MF sequencing and GBS library preparation. GBS library construction is based on the selection of a subset of genomic regions using methylation-sensitive restriction enzymes, which avoid repetitive regions50. To select our GBS regions, we used the enzyme PstI, which is a methyl-sensitive restriction enzyme, to select hypomethylated DNA111. Similarly, the MF genome was obtained through a process of sequencing where genomic regions were also selected based on hypomethylation65. This approach generated high compatibility between our data and the genomic reference, as observed in the comparative alignments and previous reports23. Although there have been great advances in understanding the sugarcane genome since the S. spontaneum genome became available, we decided to perform our analyses using the sugarcane MF genome to capture the most probable markers and establish a criterion based on data appropriateness. This genomic reference is still at the scaffold level, but as shown in this study, there is a high rate of redundancy among consensus sequences obtained through GBS data alignments with the different references. Due to this observed redundancy, we chose not to use all of the references. In addition to adding redundant markers, it is important to note that these different consensus contigs built based on different references can lead to different alignments of GBS data. These alignments may in turn produce different organizational profiles of read alignments and divergent SNPs. Therefore, we selected the most reference with the best usage of the amount of GBS data as the most appropriate and analyzed the respective SNPs.
A wide range of SNP callers are available. Tassel was developed to handle GBS data and has been widely applied to species with different genomic organizations. Although this tool enables the identification of many SNPs, it was previously described as insufficiently accurate to be used alone112. Thus, to increase the reliability of our data, we decided to use other SNP callers (GATK, FreeBayes, SAMtools and Stacks) in combination with Tassel, as the usage of SNPs identified by more than one caller is more reliable than the usage of SNPs identified by only one caller113. The intersection between the SNPs identified by at least two tools was established to increase the accuracy of these variants without substantially increasing the number of false negatives. In addition, Tassel was used due to its targeted development for GBS data and preprocessing steps. The Tassel workflow keeps read depths unchanged between the initial mapping and the final data generated for the identified genotypes. In sugarcane, this information is necessary to estimate ADs or calculate APs. Using this intersection approach, we identified the final set of SNPs to be used for our association analyses. Indels, however, were not selected. These variants identified by in silico strategies do not provide reliable information, showing elevated divergence between the existent callers and a probability of producing spurious variants114.
Using this approach, we found 14,540 putative SNPs. With these regions, we tested two different strategies for genotyping the population at these loci: (1) the usage of ADs estimated with SuperMASSA and (2) the usage of APs calculated based on Tassel output. For SuperMASSA estimations, we kept only SNPs with an estimated ploidy between 6 and 14 (minimum posterior probability of 0.8) due to sugarcane genomic configurations7,23. However, sugarcane aneuploidy together with the common occurrence of duplication events might have influenced the process of estimating locus ploidies and, in turn, the process of categorizing the related dosages through the established filters. In addition, 64% of the identified SNPs were discarded when using this approach for obtaining dosages. Because we would not need to calculate chromosomal distances between loci for linkage map construction, the elevated loss of markers and the reduced performance of ADs in the task of genomic prediction, we decided to continue our analyses with APs. Previous tests of this approach yielded reasonable results101,102,104.
After establishing the bioinformatics pipeline for identifying and evaluating these regions, we studied the influence of SNP subsets identified by FS techniques on the task of predicting phenotypic rust groups. The amount of data generated by high-throughput sequencing technologies115 represents a challenge in genomic prediction, particularly due to the difficulty of working with high-dimensional datasets, i.e., the ’large p, small n’ problem116. This increase in the amount of available information makes the task of directly applying these marker data in genomic analyses more difficult and necessitates appropriate preprocessing steps117. In this study, we proposed the use of FS techniques to select a smaller set of SNPs with more predictive power than the entire dataset and closer associations with the brown rust phenotype to assist the identification of regions associated with disease status. This can be considered quite advantageous in the context of genomic selection because the identification of a subset of markers allows a reduction in sequencing costs49. In addition, it has already been demonstrated that for genomic selection, a selected reduced number of SNPs has reasonable reliability49,118,119.
The identification of markers related to this phenotype using FS is based on these techniques to provide an interpretable model due to the close relation between trait and genotype; i.e., using the subset of high-density markers might help elucidate the regions most likely to be involved in phenotypic differentiation120. This strategy of selecting a subgroup of SNPs with higher predictive power and closeness to the predictive class has already been employed in different contexts48,121,122. In this study, we tested five different strategies and found three promising alternatives for executing this methodology. FS1, FS2 and FS4 substantially increased the models’ capabilities of predicting the phenotypic groups as demonstrated in this paper. We believe that this increase in predictive power is due to the identification of regions influencing the phenotype, possibly in QTLs or regulatory genomic elements. As a final strategy for the prediction and selection of these associated regions, we suggest the use of the intersection of these three techniques. This approach enabled the creation of more stable models using different ML algorithms and better accuracies for predicting these phenotypes.
Corroborating this hypothesis, we also found that most of the identified regions containing these SNPs were associated with QTLs with known biological functions, and there were also additional categories known to be correlated with rust resistance. Through comparative alignments between MF scaffolds and S. spontaneum CDSs, we identified these regions and compared them with CDSs correlated with BACs developed based on Bru1 regions. A total of 146 different scaffolds were selected as important for this predictive task by at least two methods (FS1, FS2 and FS4). Among these sequences, only 9 did not have correspondence with S. spontaneum CDSs, possibly due to the presence of additional noncoding regulatory elements. These regions can be targets of genetic studies due to their relationships with predicted phenotypes. Although there was no considerable intersection between CDSs associated with BACs and the selected scaffolds, we did find consensus in correlated biological functions. This divergence between regions is mainly explained by the differences between the populations used to generate the GBS data and the brown rust QTLs (which were used to select BACs). QTL regions are identified for a specific population, and there might be differences between datasets from different populations, especially for the sugarcane genome. In addition, the creation of sugarcane linkage maps relies on many adaptations of methods, such as the selection of only single-dosage markers23, which might lead to the identification of a restricted set of QTLs and the nonuse of many auxiliary genomic elements.
The exclusive GO categories related to the selected variants have already been reported to be associated with resistance. Sphingolipid metabolism is intimately connected to programmed cell death123,124,125; DNA topological change is a wider category with different implications in many biological processes, including responses to pathogens126; differences in nitrogen compound transport might be related to the accumulation of this nutrient and its influence on resistance against pathogens127; and phosphatidylinositol-mediated signaling includes important categories that also act on plants’ responses to pathogens125. A considerable number of metabolic pathways related to both BACs and the selected scaffolds were also detected. However, specific pathways were found to be associated with these scaffolds, mainly due to the different roles of the proteins encoded by these identified CDSs and because these pathways were already reported as being associated with plant responses to different pathogens123,128,129,130,131,132,133,134,135, further corroborating our findings. The indication of possible mutation events in these regions provides evidence of differences in protein expression and phenotypic characteristics.
The identified regions with putative variants and high predictive performance for brown rust phenotypic groups can be employed as novel regions to investigate susceptibility-related traits. This proposed strategy can complement traditional methodologies for deciphering sugarcane genomic regions associated with pathogen infection responses and susceptibility. Although these SNPs were identified for only one biparental population, the strategy can be used for different populations, and the genes can be further investigated to validate the influence of the genomic regions on different phenotypes. This study represents an initial step in employing ML and FS strategies in sugarcane genomic studies. We illustrated the great potential of applying these methodologies to predict phenotypes by using a highly complex polyploid species.
Code availability
Accession codes Sequencing data are available through the Sequence Read Archive (SRA) database with the accession code SRP151376.
References
Zhang, J. et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat. Genet. 50, 1565 (2018).
Chiconato, D. A., Junior, G. D., dos Santos, D. M. & Munns, R. Adaptation of sugarcane plants to saline soil. Environ. Exp. Botany 162, 201–211 (2019).
Garsmeur, O. et al. A mosaic monoploid reference sequence for the highly complex genome of sugarcane. Nat. Commun. 9, 2638 (2018).
D’Hont, A., Ison, D., Alix, K., Roux, C. & Glaszmann, J. C. Determination of basic chromosome numbers in the genus Saccharum by physical mapping of ribosomal RNA genes. Genome 41, 221–225 (1998).
Yang, X. et al. Constructing high-density genetic maps for polyploid sugarcane (Saccharum spp.) and identifying quantitative trait loci controlling brown rust resistance. Mol. Breed. 37, 116 (2017).
Hoang, N. V. et al. A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics 18, 395 (2017).
Garcia, A. A. et al. SNP genotyping allows an in-depth characterisation of the genome of sugarcane and other complex autopolyploids. Sci. Rep. 3, 3399 (2013).
Sforça, D. A. et al. Gene duplication in the sugarcane genome: A case study of allele interactions and evolutionary patterns in two genic regions. Front. Plant Sci. 10, 553 (2019).
D’Hont, A. Unraveling the genome structure of polyploids using FISH and GISH; examples of sugarcane and banana. Cytogenet. Genome Res. 109, 27–33 (2005).
Mancini, M. C., Cardoso-Silva, C. B., Sforça, D. A. & Pereira de Souza, A. “Targeted sequencing by gene synteny,” a new strategy for polyploid species: Sequencing and physical structure of a complex sugarcane region. Front. Plant Sci. 9, 397 (2018).
Balsalobre, T. W. et al. Mixed modeling of yield components and brown rust resistance in sugarcane families. Agron. J. 108, 1824–1837 (2016).
Racedo, J. et al. Molecular diagnostic of both brown and orange sugarcane rust and evaluation of sugarcane brown rust resistance in Tucuman, Argentina, using molecular markers associated with Bru1 a broad-range resistance allele. Sugar Tech. 18, 414–419 (2016).
Li, Z. et al. Molecular insights into brown rust resistance and potential epidemic based on the Bru1 gene in sugarcane varieties and new elite clones. Euphytica 214, 189 (2018).
Wang, X.-Y. et al. Developing genetically segregating populations for localization of novel sugarcane brown rust resistance genes. Euphytica 215, 159 (2019).
Rott, P. A guide to sugarcane diseases, Editions Quae, (2000)
Hoy, J. & Hollier, C. Effect of brown rust on yield of sugarcane in Louisiana. Plant Dis. 93, 1171–1174 (2009).
Asnaghi, C. et al. Targeted mapping of a sugarcane rust resistance gene (Bru1) using bulked segregant analysis and AFLP markers. Theor. Appl. Genet. 108, 759–764 (2004).
Costet, L. et al. Haplotype structure around Bru1 reveals a narrow genetic basis for brown rust resistance in modern sugarcane cultivars. Theor. Appl. Genet. 125, 825–836 (2012).
Daugrois, J.-H. et al. A putative major gene for rust resistance linked with a RFLP marker in sugarcane cultivar ‘R570’. Theor. Appl. Genet. 92, 1059–1064 (1996).
Raboin, L.-M. et al. Genetic mapping in sugarcane, a high polyploid, using bi-parental progeny: Identification of a gene controlling stalk colour and a new rust resistance gene. Theor. Appl. Genet. 112, 1382–1391 (2006).
Li, W.-F. et al. Identification of field resistance and molecular detection of the brown rust resistance gene bru1 in new elite sugarcane varieties in China. Crop Prot. 103, 46–50 (2018).
Mollinari, M. & Garcia, A. A. F. Linkage analysis and haplotype phasing in experimental autopolyploid populations with high ploidy level using hidden Markov models. G3 Genes Genomes Genet. 9, 3297–3314 (2019).
Balsalobre, T. W. A. et al. GBS-based single dosage markers for linkage and QTL mapping allow gene mining for yield-related traits in sugarcane. BMC Genomics 18, 72 (2017).
Costa, E. A. et al. QTL mapping including codominant SNP markers with ploidy level information in a sugarcane progeny. Euphytica 211, 1–16 (2016).
Bourke, P. M. et al. polymapR–linkage analysis and genetic map construction from F1 populations of outcrossing polyploids. Bioinformatics 34, 3496–3502 (2018).
Grandke, F., Ranganathan, S., van Bers, N., de Haan, J. R. & Metzler, D. PERGOLA: Fast and deterministic linkage mapping of polyploids. BMC Bioinform. 18, 12 (2017).
Behrouzi, P. & Wit, E. C. De novo construction of polyploid linkage maps using discrete graphical models. arXiv preprint arXiv:1710.01063 (2017).
Yang, X. et al. Identifying quantitative trait loci (QTLs) and developing diagnostic markers linked to orange rust resistance in sugarcane (Saccharum spp.). Front. Plant Sci. 9, 350 (2018).
Muranty, H. et al. Potential for marker-assisted selection for forest tree breeding: Lessons from 20 years of MAS in crops. Tree Genet. Genomes 10, 1491–1510 (2014).
Cros, D. et al. Within-family genomic selection in rubber tree (Hevea brasiliensis) increases genetic gain for rubber production. Ind. Crops Prod. 138, 111464 (2019).
Crossa, J. et al. Genomic selection in plant breeding: Methods, models, and perspectives. Trends Plant Sci. 22, 961–975 (2017).
Hadasch, S., Simko, I., Hayes, R. J., Ogutu, J. O. & Piepho, H.-P. Comparing the predictive abilities of phenotypic and marker-assisted selection methods in a biparental lettuce population. Plant Genome 9, 1 (2016).
Heffner, E. L., Sorrells, M. E. & Jannink, J.-L. Genomic selection for crop improvement. Crop Sci. 49, 1–12 (2009).
Park, S., Jackson, P., Berding, N. & Inman-Bamber, G. Conventional breeding practices within the Australian sugarcane breeding program. Proc. Austral. Soc. Sugar Cane Technol. 29, 113–121 (2007).
Hayes, B. et al. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).
Li, X. et al. Genomic prediction of biomass yield in two selection cycles of a tetraploid alfalfa breeding population. Plant Genome 8, 1 (2015).
Norman, A. et al. Increased genomic prediction accuracy in wheat breeding using a large Australian panel. Theor. Appl. Genet. 130, 2543–2555 (2017).
Gouy, M. et al. Experimental assessment of the accuracy of genomic selection in sugarcane. Theor. Appl. Genet. 126, 2575–2586 (2013).
González-Camacho, J. M. et al. Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome 11, 1–15 (2018).
Zhang, J. et al. Computer vision and machine learning for robust phenotyping in genome-wide studies. Sci. Rep. 7, 44048 (2017).
Grinberg, N. F. et al. Implementation of genomic prediction in Lolium perenne (L.) breeding populations. Front. Plant Sci. 7, 133 (2016).
Edwards, S. M., Sørensen, I. F., Sarup, P., Mackay, T. F. & Sørensen, P. Genomic prediction for quantitative traits is improved by mapping variants to gene ontology categories in Drosophila melanogaster. Genetics 203, 1871–1883 (2016).
Hickey, J. M. et al. Evaluation of genomic selection training population designs and genotyping strategies in plant breeding programs using simulation. Crop Sci. 54, 1476–1488 (2014).
Verleysen, M. & François, D. The curse of dimensionality in data mining and time series prediction. In International Work-Conference on Artificial Neural Networks, 758–770 (Springer, Berlin, 2005).
Dash, M. & Liu, H. Feature selection for classification. Intell. Data Anal. 1, 131–156 (1997).
Li, J. et al. Feature selection: A data perspective. ACM Comput. Surv. CSUR 50, 94 (2018).
Li, B. et al. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front. Genet. 9, 237 (2018).
Bermingham, M. L. et al. Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Sci. Rep. 5, 10312 (2015).
Long, N., Gianola, D., Rosa, G. & Weigel, K. Dimension reduction and variable selection for genomic selection: Application to predicting milk yield in holsteins. J. Anim. Breed. Genet. 128, 247–257 (2011).
Elshire, R. J. et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE 6, e19379 (2011).
Cardoso-Silva, C. B. et al. De novo assembly and transcriptome analysis of contrasting sugarcane varieties. PLoS ONE 9, e88462 (2014).
Santos, F. R. et al. Marker-trait association and epistasis for brown rust resistance in sugarcane. Euphytica 203, 533–547 (2015).
Amorim, L. et al. Metodologia de avaliação da ferrugem da cana-de-açúcar (puccinia melanocephala). Boletim Técnico Copersucar 39, 13–16 (1987).
Team, R. C. R: A Language and Environment for Statistical Computing (2013).
Peterson, R. bestNormalize: normalizing Transformation Functions, R package version 1.2. 0 (2018).
Muñoz, F. & Sanchez, L. breedR: Statistical Methods for Forest Genetic Resources Analysts R package version 0.12-4. (2019).
Kassambara, A. & Mundt, F. Package ‘factoextra’. Extract and Visualize the Results of Multivariate Data Analyses, Vol. 76, (2017).
Aljanabi, S. M., Forget, L. & Dookun, A. An improved and rapid protocol for the isolation of polysaccharide-and polyphenol-free sugarcane DNA. Plant Mol. Biol. Report. 17, 281–282 (1999).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Mukherjee, S., Huntemann, M., Ivanova, N., Kyrpides, N. C. & Pati, A. Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Stand. Genomic Sci. 10, 18 (2015).
Andrews, S. et al. FastQC: A quality control tool for high throughput sequence data (2010).
Gordon, A. et al. Fastx-toolkit. A Short-Reads Preprocessing Tools (Unpublished), Vol. 5, (2010).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).
Grativol, C. et al. Sugarcane genome sequencing by methylation filtration provides tools for genomic research in the genus Saccharum. Plant J. 79, 162–172 (2014).
Goodstein, D. M. et al. Phytozome: A comparative platform for green plant genomics. Nucleic Acids Res. 40, D1178–D1186 (2011).
Riaño-Pachón, D. M. & Mattiello, L. Draft genome sequencing of the sugarcane hybrid SP80-3280. F1000Research 6 (2017).
Nishiyama-Jr, M. et al. The SUCEST-FUN regulatory network database: Designing an energy grass. Proc. Int. Soc. Sugar Cane Technol. 27, 1–10 (2010).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A. & Cresko, W. A. Stacks: An analysis tool set for population genomics. Mol. Ecol. 22, 3124–3140 (2013).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: Quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Gu, Z., Gu, L., Eils, R., Schlesner, M. & Brors, B. Circlize implements and enhances circular visualization in R. Bioinformatics 30, 2811–2812 (2014).
Pereira, G. S., Garcia, A. A. F. & Margarido, G. R. A fully automated pipeline for quantitative genotype calling from next generation sequencing data in autopolyploids. BMC Bioinform. 19, 398 (2018).
Glaubitz, J. C. et al. TASSEL-GBS: A high capacity genotyping by sequencing analysis pipeline. PLoS ONE 9, e90346 (2014).
McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 (2012).
Broad Institute. Picard toolkit. Broad Institute, GitHub repository. http://broadinstitute.github.io/picard/ (2018).
Chen, H. & Boutros, P. C. VennDiagram: A package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinform. 12, 35 (2011).
Serang, O., Mollinari, M. & Garcia, A. A. F. Efficient exact maximum a posteriori computation for Bayesian SNP genotyping in polyploids. PLoS ONE 7, e30906 (2012).
Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans Inf. Theory 13, 21–27 (1967).
Cristianini, N. et al. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods (Cambridge University Press, Cambridge, 2000).
Rasmussen, C. E. Gaussian processes in machine learning. In Summer School on Machine Learning, 63–71 (Springer, Berlin, 2003).
Quinlan, J. R. Induction of decision trees. Mach. Learn. 1, 81–106 (1986).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Popescu, M.-C., Balas, V. E., Perescu-Popescu, L. & Mastorakis, N. Multilayer perceptron and neural networks. WSEAS Trans. Circuits Syst. 8, 579–588 (2009).
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997).
Friedman, N., Geiger, D. & Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997).
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Hunter, J. D. Matplotlib: A 2d graphics environment. Comput. Sci. Eng. 9, 90 (2007).
Chen, T. & Guestrin, C. Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794 (ACM, New York, 2016).
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
Warnes, G. R., Bolker, B., Lumley, T., Warnes, M. G. R. & Imports, M. Package ‘gmodels’ (2018).
de Mendiburu, F. & de Mendiburu, M. F. Package ‘agricolae’. R Package, Version 1–2 (2019).
Garsmeur, O. et al. High homologous gene conservation despite extreme autopolyploid redundancy in sugarcane. New Phytol. 189, 629–642 (2011).
Benson, D. A. et al. Genbank. Nucleic Acids Res. 28, 15–18 (2000).
Gel, B. & Serra, E. karyoploteR: An R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics 33, 3088–3090 (2017).
Ashburner, M. et al. Gene ontology: Tool for the unification of biology. Nat. Genet. 25, 25 (2000).
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Supek, F., Bošnjak, M., Škunca, N. & Šmuc, T. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS ONE 6, e21800 (2011).
Song, J. et al. Natural allelic variations in highly polyploidy saccharum complex. Front. Plant Sci. 7, 804 (2016).
Yang, X., Luo, Z., Todd, J., Sood, S. & Wang, J. Genome-wide association study of multiple yield components in a diversity panel of polyploid sugarcane (Saccharum spp.). bioRxiv 387001 (2018).
Yang, X., Sood, S., Luo, Z., Todd, J. & Wang, J. Genome-wide association studies identified resistance loci to orange rust and yellow leaf virus diseases in sugarcane (Saccharum spp.). Phytopathology 109, 623–631 (2019).
Yang, X. et al. Target enrichment sequencing of 307 germplasm accessions identified ancestry of ancient and modern hybrids and signatures of adaptation and selection in sugarcane (Saccharum spp.), a ‘sweet’ crop with ‘bitter’ genomes. Plant Biotechnol. J. 17, 488–498 (2019).
Yang, X. et al. Identifying loci controlling fiber composition in polyploid sugarcane (Saccharum spp.) through genome-wide association study. Ind. Crops Prod. 130, 598–605 (2019).
Yang, X. et al. Mining sequence variations in representative polyploid sugarcane germplasm accessions. BMC Genomics 18, 594 (2017).
Fickett, N. et al. Genome-wide association mapping identifies markers associated with cane yield components and sucrose traits in the louisiana sugarcane core collection. Genomics (2018).
Islam, M. S., Yang, X., Sood, S., Comstock, J. C. & Wang, J. Molecular characterization of genetic basis of Sugarcane Yellow Leaf Virus (SCYLV) resistance in Saccharum spp. hybrid. Plant Breed. 137, 598–604 (2018).
Li, H. et al. A high density GBS map of bread wheat and its application for dissecting complex disease resistance traits. BMC Genomics 16, 216 (2015).
Poland, J. A. & Rife, T. W. Genotyping-by-sequencing for plant breeding and genetics. Plant Genome 5, 92–102 (2012).
Benevenuto, J., Ferrão, L. F. V., Amadeu, R. R. & Munoz, P. How can a high-quality genome assembly help plant breeders?. GigaScience 8, giz068 (2019).
Fellers, J. P. Genome filtering using methylation-sensitive restriction enzymes with six base pair recognition sites. Plant Genome 1, 146–152 (2008).
Torkamaneh, D., Laroche, J. & Belzile, F. Genome-wide SNP calling from genotyping by sequencing (GBS) data: A comparison of seven pipelines and two sequencing technologies. PLoS ONE 11, e0161333 (2016).
Hwang, S., Kim, E., Lee, I. & Marcotte, E. M. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci. Rep. 5, 17875 (2015).
Tian, S., Yan, H., Kalmbach, M. & Slager, S. L. Impact of post-alignment processing in variant discovery from whole exome data. BMC Bioinform. 17, 403 (2016).
Reuter, J. A., Spacek, D. V. & Snyder, M. P. High-throughput sequencing technologies. Mol. Cell 58, 586–597 (2015).
Bernardo, J. et al. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Stat. 7, 733–742 (2003).
Tadist, K., Najah, S., Nikolov, N. S., Mrabti, F. & Zahi, A. Feature selection methods and genomic big data: A systematic review. J. Big Data 6, 79 (2019).
Weigel, K. et al. Predictive ability of direct genomic values for lifetime net merit of holstein sires using selected subsets of single nucleotide polymorphism markers. J. Dairy Sci. 92, 5248–5257 (2009).
Usai, M. G., Goddard, M. E. & Hayes, B. J. Lasso with cross-validation for genomic selection. Genet. Res. 91, 427–436 (2009).
Haws, D. C. et al. Variable-selection emerges on top in empirical comparison of whole-genome complex-trait prediction methods. PLoS ONE 10, e0138903 (2015).
Long, N., Gianola, D., Rosa, G. J., Weigel, K. A. & Avendaño, S. Machine learning classification procedure for selecting SNPs in genomic selection: Application to early mortality in broilers. J. Anim. Breed. Genet. 124, 377–389 (2007).
Phuong, T. M., Lin, Z. & Altman, R. B. Choosing SNPs using feature selection. J. Bioinform. Comput. Biol. 4, 241–257 (2006).
Chandra, S. et al. De novo assembled wheat transcriptomes delineate differentially expressed host genes in response to leaf rust infection. PLoS ONE 11, e0148453 (2016).
Rojas, C. M., Senthil-Kumar, M., Tzin, V. & Mysore, K. Regulation of primary plant metabolism during plant–pathogen interactions and its contribution to plant defense. Front. Plant Sci. 5, 17 (2014).
Berkey, R., Bendigeri, D. & Xiao, S. Sphingolipids and plant defense/disease: The “death” connection and beyond. Front. Plant Sci. 3, 68 (2012).
Ahmed, M. B. et al. A rust fungal effector binds plant DNA and modulates transcription. Sci. Rep. 8, 14718 (2018).
Mur, L. A., Simpson, C., Kumari, A., Gupta, A. K. & Gupta, K. J. Moving nitrogen to the centre of plant defence against pathogens. Ann. Botany 119, 703–709 (2017).
Hammerbacher, A., Coutinho, T. A. & Gershenzon, J. Roles of plant volatiles in defense against microbial pathogens and microbial exploitation of volatiles. Cell Environ. Plant 42, 2827–2843 (2019).
Jeandet, P., Clément, C. & Cordelier, S. Regulation of resveratrol biosynthesis in grapevine: New approaches for disease resistance?. J. Exp. Botany 70, 375–378 (2019).
Stefanowicz, K. et al. Glycan-binding F-box protein from Arabidopsis thaliana protects plants from Pseudomonas syringae infection. BMC Plant Biol. 16, 213 (2016).
Chojak-Koźniewska, J., Kuźniak, E., Linkiewicz, A. & Sowa, S. Primary carbon metabolism-related changes in cucumber exposed to single and sequential treatments with salt stress and bacterial infection. Plant Physiol. Biochem. 123, 160–169 (2018).
Fu, X., Li, C., Zhou, X., Liu, S. & Wu, F. Physiological response and sulfur metabolism of the V. dahliae-infected tomato plants in tomato/potato onion companion cropping. Sci. Rep. 6, 36445 (2016).
De Bigault Du Granrut, A. & Cacas, J.-L. How very-long-chain fatty acids could signal stressful conditions in plants?. Front. Plant Sci. 7, 1490 (2016).
Adams, E. H. & Spoel, S. H. The ubiquitin-proteasome system as a transcriptional regulator of plant immunity. J. Exp. Botany 69, 4529–4537 (2018).
Maag, D., Erb, M., Köllner, T. G. & Gershenzon, J. Defensive weapons and defense signals in plants: Some metabolites serve both roles. BioEssays 37, 167–174 (2015).
Acknowledgements
This work was supported by grants from the Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP 2008/52197-4 and 2005/55258-6) and Coordenação de Aperfeiçamento de Pessoal de Nível Superior (CAPES, Computational Biology Programme). A.A. received a PhD fellowship from FAPESP (2019/03232-6); E.C. received a PhD fellowship from FAPESP (2010/50031-1); R.P. received a MSc fellowship from FAPESP (2018/18588-8); and M.M. received a PD fellowship from FAPESP (2014/11482-9) and CAPES (88882.160095/2013-01).
Author information
Authors and Affiliations
Contributions
A.A. performed all analyses and wrote the manuscript; E.C. created the GBS library and performed the sequencing experiments; H.R. assisted in the execution of GBS quality control procedures and preprocessing steps; J.N. collaborated in the creation of ML models; R.P. and M.M. contributed to manuscript writing; F.S., L.P. and M.L. were responsible for the phenotypic experiments; and A.S. and R.K. conceived the project. All authors reviewed, read and approved the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Aono, A.H., Costa, E.A., Rody, H.V.S. et al. Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance. Sci Rep 10, 20057 (2020). https://doi.org/10.1038/s41598-020-77063-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-020-77063-5
This article is cited by
-
Biotechnologies to Improve Sugarcane Productivity in a Climate Change Scenario
BioEnergy Research (2023)
-
A divide-and-conquer approach for genomic prediction in rubber tree using machine learning
Scientific Reports (2022)
-
A joint learning approach for genomic prediction in polyploid grasses
Scientific Reports (2022)
-
Genome-wide approaches for the identification of markers and genes associated with sugarcane yellow leaf virus resistance
Scientific Reports (2021)
-
Genome wide association studies in sugarcane host pathogen system for disease resistance: an update on the current status of research
Indian Phytopathology (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.