Genome-wide approaches for the identification of markers and genes associated with sugarcane yellow leaf virus resistance

Pimenta, Ricardo José Gonzaga; Aono, Alexandre Hild; Burbano, Roberto Carlos Villavicencio; Coutinho, Alisson Esdras; da Silva, Carla Cristina; dos Anjos, Ivan Antônio; Perecin, Dilermando; Landell, Marcos Guimarães de Andrade; Gonçalves, Marcos Cesar; Pinto, Luciana Rossini; de Souza, Anete Pereira

doi:10.1038/s41598-021-95116-1

Download PDF

Article
Open access
Published: 03 August 2021

Genome-wide approaches for the identification of markers and genes associated with sugarcane yellow leaf virus resistance

Ricardo José Gonzaga Pimenta¹,
Alexandre Hild Aono¹,
Roberto Carlos Villavicencio Burbano²,
Alisson Esdras Coutinho³,
Carla Cristina da Silva¹,
Ivan Antônio dos Anjos⁴,
Dilermando Perecin³,
Marcos Guimarães de Andrade Landell⁴,
Marcos Cesar Gonçalves⁵,
Luciana Rossini Pinto⁴ &
…
Anete Pereira de Souza ORCID: orcid.org/0000-0003-3831-9829^1,6

Scientific Reports volume 11, Article number: 15730 (2021) Cite this article

4423 Accesses
18 Citations
42 Altmetric
Metrics details

Subjects

Abstract

Sugarcane yellow leaf (SCYL), caused by the sugarcane yellow leaf virus (SCYLV) is a major disease affecting sugarcane, a leading sugar and energy crop. Despite damages caused by SCYLV, the genetic base of resistance to this virus remains largely unknown. Several methodologies have arisen to identify molecular markers associated with SCYLV resistance, which are crucial for marker-assisted selection and understanding response mechanisms to this virus. We investigated the genetic base of SCYLV resistance using dominant and codominant markers and genotypes of interest for sugarcane breeding. A sugarcane panel inoculated with SCYLV was analyzed for SCYL symptoms, and viral titer was estimated by RT-qPCR. This panel was genotyped with 662 dominant markers and 70,888 SNPs and indels with allele proportion information. We used polyploid-adapted genome-wide association analyses and machine-learning algorithms coupled with feature selection methods to establish marker-trait associations. While each approach identified unique marker sets associated with phenotypes, convergences were observed between them and demonstrated their complementarity. Lastly, we annotated these markers, identifying genes encoding emblematic participants in virus resistance mechanisms and previously unreported candidates involved in viral responses. Our approach could accelerate sugarcane breeding targeting SCYLV resistance and facilitate studies on biological processes leading to this trait.

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Differential selection of yield and quality traits has shaped genomic signatures of cowpea domestication and improvement

Article 22 April 2024

Deep learning the cis-regulatory code for gene expression in selected model plants

Article Open access 25 April 2024

Introduction

Sugarcane is one of the world’s most important crops, ranking first in production quantity and sixth in net production value in 2016¹. It is by far the most relevant sugar crop, accounting for approximately 80% of the world’s sugar production^1,2 and is also a prominent energy crop. However, it has an extremely complex genome; modern cultivars are the product of a few crosses between two autopolyploid species. Saccharum spontaneum (2n = 5x = 40 to 16x = 128; x = 8)³, a wild stress-resistant but low-sugar species, was hybridized and backcrossed with Saccharum officinarum (2n = 8x = 80, x = 10)⁴, which has a high sugar content but is sensitive to drought and susceptible to diseases. These procedures gave origin to plants with very large (ca. 10 Gb), highly polyploid, aneuploid and remarkably duplicated genomes^5,6. This complexity directly affects sugarcane research and breeding and, until recently, it also prevented the use of codominance information in marker-assisted breeding strategies for this crop, limiting such approaches^7,8.

One of the diseases that affect this crop is sugarcane yellow leaf (SCYL), which is caused by sugarcane yellow leaf virus (SCYLV), a positive-sense ssRNA virus belonging to the Polerovirus genus^9,10. The expression of SCYL symptoms is complex and usually occurs in late stages of plant development, being mainly characterized by the intense yellowing of midribs in the abaxial surface of leaves^11,12. SCYLV alters the metabolism and transport of sucrose and photosynthetic efficiency^13,14, impairing plant development, which eventually reflects in productivity losses^{15,16,17,18,19,20}. Many SCYL symptoms may, however, be caused by other stresses or plant senescence^12,15,21, making SCYL identification troublesome. Therefore, molecular diagnosis of SCYLV infection is of great importance; this was initially performed through immunological assays¹¹, but more sensitive and accurate methods using reverse transcription followed by quantitative polymerase chain reaction (RT-qPCR) were later developed^18,22,23.

Due to SCYL's elusive symptomatology, SCYLV’s spread is silent; it is disseminated mostly during sugarcane vegetative propagation but is also transmitted by aphids, mainly the sugarcane aphid Melanaphis sacchari (Zehntner, 1897)¹¹. Unlike other pathogens, the virus is not efficiently eradicated by thermal treatments²⁴; the only way to thoroughly eliminate it is by meristem micropropagation^25,26, which is time-consuming and requires specialized infrastructure and personnel. These features make varietal resistance to SCYLV the most efficient resource to prevent damage and losses caused by this virus. Resistance has been explored in breeding programs and by a few genetic mapping studies^{27,28,29,30,31,32}. However, research on SCYL genetics is not exempt from the difficulties generated by the complexity of the sugarcane genome³³. Due to this crop’s polyploid nature, most of these works employed dominantly scored molecular markers, implying a great loss of genetic information³⁴. Additionally, they employed immunological methods to phenotype SCYLV resistance. The usage of dominant markers and the poor reliability of phenotyping were listed as key factors limiting the power of these studies^28,29.

Here, we evaluated the efficacy of several genome-wide approaches to identify markers and genes associated with SCYLV resistance. We analyzed a panel of Saccharum accessions inoculated with SCYLV, which were graded for the severity of SCYL symptoms, and their viral titer was estimated by relative and absolute RT-qPCR. This panel was genotyped with amplified fragment length polymorphisms (AFLPs) and simple sequence repeats (SSRs), as well as single nucleotide polymorphisms (SNPs) and insertions and deletions (indels) obtained by genotyping-by-sequencing (GBS). We then employed three distinct methodologies to detect marker-trait associations: the fixed and random model circulating probability unification (FarmCPU) method using dominant AFLPs and SSRs; mixed linear modeling using SNPs and indels, in which allele proportions (APs) in each locus were employed to establish genotypic classes and estimate additive and dominant effects; and several machine learning (ML) methods coupled with feature selection (FS) techniques, using all markers to predict genotype attribution to phenotypic clusters. Finally, we annotated genes containing markers associated with phenotypes, discussing the putative participation of these genes in the mechanisms underlying resistance to SCYLV.

Results

Phenotypic data analyses

A total of 97 sugarcane accessions inoculated with SCYLV were evaluated for the severity of SCYL symptoms and for viral titer estimated by relative and absolute RT-qPCR quantification in two consecutive years, as comprehensively described in Supplementary Results. Based on best linear unbiased prediction (BLUP) estimations, symptom severity was not correlated with the viral titer determined by relative (p = 0.117) or absolute (p = 0.296) quantification. We found, however, a significant (p < 2.2e−16) and strong (r² = 0.772) correlation between the values obtained by the two quantification methods, indicating their reliability (Supplementary Fig. 2).

Using BLUP values, we performed two hierarchical clustering on principal components (HCPC) analyses to investigate the classification of genotypes according to SCYLV resistance phenotypes—the first using BLUP values of SCYLV titers determined by RT-qPCR, and the second including BLUP values of all three traits analyzed. Both analyses indicated a division of the panel into three clusters (Supplementary Figs. 3–4)—named Q1-3 for the first HCPC and SQ1-3 for the second analysis. Factor maps wherein these groups are plotted onto the first two dimensions of HCPCs are shown in Fig. 1, and the attribution of genotypes to each cluster is available in Supplementary Table 4. Each group defined in the first HCPC presented significantly different SCYLV titers as estimated by both quantification methods (Supplementary Fig. 5, Supplementary Table 5). The second HCPC also resulted in a separation of groups with contrasting phenotypes: SQ1 accessions showed the least severe SCYL symptoms and the lowest titers of SCYLV; SQ2 accessions displayed significantly more severe disease symptoms and higher viral titers; and SQ3 accessions had the most severe disease symptoms and equally higher virus titers (Supplementary Fig. 6, Supplementary Table 5).

Genotyping and genetic analyses

After genotyping and filtering procedures, 93 accessions of the panel were successfully characterized with 550 AFLP fragments and 112 SSR fragments, totaling 662 polymorphic dominant markers. The GBS library constructed allowed the successful genotyping of 92 panel accessions, as described in detail in the Supplementary Results. We performed variant calling using BWA aligner and a monoploid chromosome set isolated from the S. spontaneum genome as a reference. This genome allowed the discovery of a large number of markers (38,710 SNPs and 32,178 indels) with AP information after rigorous filtering (Supplementary Tables 6–7). Additionally, unlike many of the references tested, it provided markers with information of position at chromosome level, allowing the estimation of long-distance linkage disequilibrium (LD). Pairwise LD between markers located within chromosomes was obtained and its decay was analyzed over distance. We observed high r² values (~ 0.4) between closely distanced markers, which dropped to 0.1 at approximately 2 Mb (Fig. 2).

The genetic structure of the panel was investigated separately using the two marker datasets generated – AFLPs and SSRs scored as dominant and codominant SNPs and indels with AP information –, and three different approaches—a discriminant analysis of principal components (DAPC), a principal component analysis (PCA) followed by k-means and a Bayesian clustering implemented in STRUCTURE. Results are thoroughly described in the Supplementary Results, and Supplementary Table 8 summarizes the allocation of genotypes to the clusters identified in each analysis. Analyses performed with dominant markers identified two to four clusters, depending on the structure analysis employed (Supplementary Figs. 7–10); however, we observed extensive similarities between the groups identified in each method. A similar pattern was observed when the same three structure analyses were performed with codominant markers. Each method resulted in a unique separation of accessions, varying between two and three groups (Supplementary Figs. 11–14), but the clustering obtained by these different analyses was overall coincident. We found, however, that using dominant or codominant markers yielded noticeably different outcomes. Some overlap was observed between clusters identified by the analyses using each set of markers but, overall, groups identified by these analyses shared little resemblance. Additionally, the results from these methods did not present correspondences with those from phenotype-based HCPCs.

Association analyses

FarmCPU

For FarmCPU analyses, we tested matrices obtained from each genetic structure analysis as covariates and ran the models with no covariates. The distribution of the genomic inflation factor λ (Supplementary Fig. 15) was normal (p = 0.975) and no significant differences (p = 0.084) were observed between the inflation of p values of models. Thus, we chose to conduct FarmCPU analyses using no covariates, as this resulted in the median value of λ closest to its theoretical value under the null hypothesis (λ = 1) and in appropriate profiles of inflation of p values as seen in quantile–quantile (Q–Q) plots (Supplementary Fig. 16). Using a Bonferroni-corrected threshold of 0.05, one marker-trait association was detected for symptom severity and five associations were detected for the viral titer estimated by each quantification method—with one marker being mutually associated with both. The percentage of phenotypic variance explained by each marker ranged from 9 to 30% (Supplementary Table 9).

Mixed modeling

Twelve combinations of population structure (Q) and kinship (K) matrices were tested as effects in the codominant association models. The distribution of λ in each Q + K combination (Supplementary Fig. 17) was not normal (p = 3.253e−06) and no significant differences (p = 0.869) were detected between models. Thus, following analyses were conducted with a Q + K combination that resulted in the median value of λ closest to 1, which was obtained with the combination of the first three PCs from a PCA with both the realized relationship (MM^T) and pseudodiploid kinship matrices. As the MM^T matrix is directly computed by the GWASpoly package, we considered the Q_PCA + K_MM combination to be the most straightforward. Q–Q plots of the association analyses for SCYL symptom severity and SCYLV relative and absolute quantifications can be found in Supplementary Fig. 18; in general, all models showed appropriate inflation of p values.

A stringent significance threshold (p < 0.05 corrected by the Bonferroni method) was used to identify 35 nonredundant markers significantly associated with SCYL symptom severity (Fig. 3). Using this correction, no markers were significantly associated with SCYLV titer. In an attempt to establish a less conservative threshold for association analyses of these two traits, we employed the false discovery rate (FDR) for the correction of p values, which resulted in very low significance thresholds and the identification of thousands of associations as significant. Therefore, we ultimately opted to use an arbitrary threshold of p < 0.0001 to determine markers strongly associated with the two quantification traits. This resulted in 13 and 9 markers associated with SCYLV titer determined by relative and absolute quantifications, respectively (Fig. 3); one marker was common to both analyses. Supplementary Table 10 supplies information on all marker-trait associations identified by this approach. For each trait, we observed a redundancy between markers identified as significant by different marker-effect models; this observation was particularly common between the simplex dominant alternative and the diploidized models.

Machine learning coupled with feature selection

As a last marker-trait association method, we tested eight ML algorithms for predicting the attribution of genotypes to the phenotypic clusters identified in the HCPCs. When assessing their potential in this task using the full marker dataset, predictive accuracies varied greatly depending on the method and phenotypic groups under analysis. Accuracies were lower for the prediction of clusters associated with viral titer (Q), ranging between 39.2 and 49.6%, with an average of 44.5% (Supplementary Fig. 19a). For clusters identified including symptom severity data (SQ), accuracies were overall higher, albeit varying even more and being still unsatisfactory; they ranged between 7.9 and 73.9% (Supplementary Fig. 19b) and had an average of 58%. Therefore, we tested applying five FS methods to reduce the marker dataset, and constructed three additional reduced marker datasets consisting of intersections between FS methods.

These procedures led to considerably higher accuracies in predicting Q and SQ clusters. Three FS methods (FS1, FS2 and FS4) presented notably superior effects in increasing accuracy in both cases (Supplementary Fig. 20). In the two scenarios, the most accurate model-FS combination was a multilayer perceptron neural network (MLP) coupled with FS2, which was composed of 232 markers for Q and 170 markers for SQ. This combination resulted in average accuracies of 97.6% and 96.5% for the prediction of Q and SQ, respectively (Supplementary Tables 11 and 12). However, in both scenarios, MLP achieved the second-best results when using Inter2 datasets, composed of markers present in at least two out of the three best FS methods, which represented 190 markers for Q and 120 markers for SQ. With this strategy, we could achieve equally high accuracies (95.7% for Q and 95.4% for SQ) with further reductions in marker numbers. To farther evaluate the performance of MLP, we produced receiver operating characteristic (ROC) curves and calculated their respective area under the curves (AUCs). Prior to FS, MLP did not present satisfactory results, with ROC curves very close to the chance level and AUCs of 0.45–0.61 for Q and 0.40–0.56 for SQ (Fig. 4a). When Inter2 was used, ROC curves showed much better model performances, with AUCs of 1.00 for Q and of 0.98–1.00 for SQ (Fig. 4b). These results confirm that Inter2 markers are in fact associated with SCYLV resistance and that MLP is an appropriate model to predict clustering based on this dataset. The markers representing the reduced datasets associated with Q and SQ clusters can be found in Supplementary Tables 13 and 14, respectively. We observed twelve marker overlaps between the two datasets; interestingly, several of these markers were also identified as associated with phenotypes in the FarmCPU and mixed modeling analyses.

Marker mapping and annotation

For a better visualization of the physical location of all markers associated with SCYLV resistance, we constructed a map of their distribution along S. spontaneum’s “A” chromosomes (Fig. 5), in which we also included markers identified as associated with SCYLV resistance in previous mapping studies. Overall, markers were considerably spread along chromosomes; however, we observed regions of dense concentration of markers identified by various methods, such as the long arms of chromosomes 1 and 3. We also verified the proximity between several markers identified in the present work and by other authors, indicating their convergence and the reliability of the methods employed here.

Out of the 362 nonredundant markers associated with all phenotypes, 176 were located in genic regions and could be annotated by aligning their 2000-bp neighboring regions with the coding sequences (CDSs) of 14 Poaceae species and Arabidopsis thaliana genomes; Supplementary Table 15 contains data on the alignment with the highest percentage of identity for each marker. In some cases, where two or more markers were closely located, coincident alignments and annotations were obtained; consequently, 148 genes were representative of all the best alignments. The large majority of top-scoring alignments (117) occurred with CDSs of Sorghum bicolor, the phylogenetically closest species among those used for alignment. Fewer alignments also occurred with the CDSs of other species. Several of the annotated genes could be associated with plant resistance to viruses, as detailed in the discussion.

Discussion

We evaluated the severity of SCYL symptoms and SCYLV titer in a panel of 97 sugarcane accessions. These two traits are of great concern to breeding, as both have been associated with higher yield losses in SCYLV-infected sugarcane plants^18,22,35. Prior to phenotyping, plants were subjected to high and uniform SCYLV inoculum pressure, an innovation over all previous SCYLV genetic mapping studies^{27,28,29,30,31}, which relied on natural infection under field conditions. Using RT-qPCR, currently regarded as the most precise method for SCYLV quantification¹⁸, we assessed the viral titer in these genotypes. We found a strong and positive correlation between the BLUPs calculated for the SCYLV titers obtained by the two quantification methods employed, showing the consistency of the data. The absence of a perfect correlation might have arisen from intrinsic differences between methods, which have been responsible for disparities in viral quantification by RT-qPCR in other plant-virus interactions³⁶.

However, we observed no quantitative correlation between the severity of SCYL symptoms and SCYLV titers across the sugarcane genotypes analyzed. This finding corroborates a growing body of evidence suggesting that these traits are not strongly or necessarily correlated, i.e., high SCYLV titers are not a guarantee of more severe yellowing or of its development at all^37,38,39. This reinforces the importance of SCYLV molecular screening of sugarcane clones by breeding programs, in an effort to avoid the employment of genotypes that accumulate high viral loads asymptomatically but may inconspicuously suffer yield losses as well as serving as a virus reservoir for vector transmission to other susceptible genotypes.

To further explore this issue, we performed two HCPC analyses to discriminate accessions based on their response to SCYLV, which led to the separation of clusters with considerable phenotypic differences. In the first HCPC, using only viral quantification data, we could discern groups with significant variation in viral titers. In the second analysis, which also included symptom severity data, clusters with even more contrasting responses to SCYLV could be discriminated. Cooper and Jones⁴⁰ proposed a terminology addressing plant responses to viral infections that is still employed today^41,42,43. According to this proposal, once infected, plants present differences in their ability to restrict viral replication and invasion; the extremes of a spectrum of behaviors are plants termed susceptible and resistant. Additionally, they may also respond differently to the infection in terms of symptom development: another spectrum exists, at the extremes of which are sensitive and tolerant plants. In view of this nomenclature, we propose that the clusters identified in this second HCPC be described as follows: (SQ1) resistant, for sugarcane genotypes distinguished by low SCYLV titer and mild or no SCYL symptoms; (SQ2) tolerant, for genotypes that, despite exhibiting higher viral titers, presented few or no disease symptoms; and (SQ3) susceptible, for genotypes with the most severe symptoms and presenting high viral titers. This classification per se is of great use in sugarcane breeding, as it distinguishes not only sources of tolerance to SCYLV but also an exceptionally promising group of truly resistant genotypes.

Our main objective was, however, to identify markers associated with SCYLV resistance in a broader sense. With this aim, we performed genotyping with a combination of dominant and codominant markers, which has never been described for sugarcane. We evaluated the impact of using genomic references from various backgrounds in variant calling from GBS. In previous sugarcane GWASs, this was performed using the genome of S. bicolor^31,44,45,46, a close relative species with a well-assembled and annotated genome. However, in our analyses, this reference yielded a number of markers considerably inferior to other references. The methyl-filtered genome of the SP70-1143 cultivar yielded the most markers, in agreement with a previous study employing GBS⁴⁷; this is a plausible outcome, as this method avoids sampling of methylated regions⁴⁸ which were also filtered out for this genomic assembly⁴⁹. However, to choose the best reference for further analyses, we also considered the quality of the assembly, which greatly affects the results of GWASs in polyploids⁵⁰. The best-assembled sugarcane genome available to date is the allele-defined genome of a haploid S. spontaneum accession⁵¹. Despite presenting one of the highest total tag alignment rates, this reference also gave a very high rate of multiple alignments, leading to the identification of relatively few markers. This was probably due to the alignment of tags to hom(e)ologous regions of different alleles rather than to the duplicated regions that we intended to avoid. To circumvent this situation, we conducted our analyses with markers isolated using a monoploid chromosome set obtained from this genome, which provided a large number of markers with reliable position information.

Using these codominant markers, we analyzed the decay of LD over distance. LD has long been hypothesized to be high in sugarcane due to the short breeding history and narrow genetic base of this crop; many studies using dominant markers have estimated it to be especially high at 5–10 cM^{52,53,54,55,56}. The first study to use SNPs for this task and estimate LD decay in bp⁵⁷ indicated that LD was extremely long lasting, with the average r² decaying to 0.2 at 3.5 Mb in hybrids. Our results further confirm the persistence of LD at long distances in sugarcane, albeit indicating that it decayed more quickly—with r² dropping to 0.2 at less than 1 Mb and to 0.1 at 2 Mb. These results impact mapping studies, as a high LD implies that a low density of markers might be needed for accurate mapping of quantitative traits.

We tested several approaches to evaluate population structure in the panel using each distinct marker dataset generated, which yielded remarkably different results. Studies contrasting the usage of dominant and codominant markers in plants have shown discrepancies in measures of genetic structure and diversity^58,59,60, but this sort of comparison has never been performed including markers with dosage information in polyploids—let alone in sugarcane. In this crop, the most relatable findings available are those reported by Creste et al.⁶¹, who showed that using different dominant markers can bias genetic analyses, and thus the choice of marker must be guided by the specific goal of each study. For GWASs—for which a high density of markers is usually necessary—SNPs and indels are currently more cost-effective, as they can be easily identified in much larger numbers, in addition to offering the possibility of estimating highly-informative allele dosages or APs^62,63,64. Hence, we believe the results we obtained with codominant SNPs and indels are more reliable, as they lean on much more genetic information.

In contrast with the differences arising from the type of marker used, we observed little divergence between results of different structure methods performed with each marker dataset, and eventual discrepancies did not result in significant differences in the inflation of the association models, whose patterns were similar to those of previous studies^31,45,46,56. Therefore, we opted to perform association analyses using the covariates that resulted in the value of λ closest to 1. For FarmCPU, this corresponded to the “naive” model with no covariates; for codominant mixed modeling analysis, this was the Q_PCA + K_MM combination. K_MM is the usual choice of relationship matrix in polyploid association mapping^65,66,67, while Q matrices obtained from PCA are commonly used to control population structure in GWASs^68,69,70.

FarmCPU analyses using dominant markers identified one AFLP fragment significantly associated with symptom severity, which explained a small part of the phenotypic variation (r² = 0.116). Eight out of the nine markers associated with viral titer explained larger parts of the variation in the phenotypes (21–30%). These results are more promising than those obtained in a previous dominant GWAS targeting SCYLV resistance, which found r² ranging between 0.09 and 0.14²⁸. Albeit low, values in this range are very common in sugarcane association studies. Evidence indicates that almost all of this crop’s traits are highly quantitative, with the notable exception of brown rust resistance^71,72. For other relevant traits, it is common to find most associated markers explaining ≤ 10% of the phenotypic variation^29,44,56.

A few authors have suggested that these suboptimal results could be improved with the usage of markers with dosage, which was also performed here using SNPs and indels with AP information. Although codominant mixed modeling analyses successfully identified markers associated with SCYL symptom severity using the Bonferroni correction, the same was not observed for SCYLV titer. This was probably influenced by the modest size of the panel, a factor that restricts the power of GWASs^73,74. As previously noted by Racedo et al.⁷⁵, assembling and phenotyping large sugarcane association panels is a challenging task. Thus, it is not uncommon for association studies of this crop to evaluate fewer than 100 genotypes^{44,75,76,77,78}. Our study was particularly burdensome, as extremely laborious inoculation and quantification techniques were employed to generate highly reliable phenotypic data. Furthermore, the Bonferroni method is notorious for its conservative nature, poorly controlling false negatives^79,60,81. This led us to establish an arbitrary threshold (p < 0.0001) to select markers strongly associated with SCYLV titer for further investigation. Using this methodology, we identified 57 nonredundant markers associated with the three phenotypes.

As a last approach to identify marker-trait associations, we tested several ML algorithms coupled with FS methods to predict genotype attribution to phenotypic clusters identified by HCPC analyses. Unlike methods built on classical statistics, these algorithms are not as heavily impacted by the sample size. We could achieve very high accuracies of prediction (up to 95%) with considerably reduced datasets comprising 120–190 markers. These results are very similar to what was obtained for predicting sugarcane brown rust resistance groups, where an accuracy of 95% was obtained using 131 SNPs⁶⁴. Marker datasets selected by ML have rarely been employed in genetic association studies in plants, but the few existing examples show their power to identify genes associated with phenotypes of interest^82,83,84.

We annotated 176 markers associated with SCYLV resistance to 148 genes. Many candidates do not allow extensive discussion on their involvement in resistance to this disease, as they either have very generic descriptions or have not been previously linked to plant virus resistance. Other proteins have occasionally been associated with responses to viruses but are members of very large gene families with extremely diverse biological roles and will not be discussed. Remarkably, few candidates encode proteins previously associated with the response to SCYLV infection. This was the case for SbRio.10G317500.1, encoding a peroxidase precursor. Peroxidases are long known to be activated in response to pathogens, but most notably, a guaiacol peroxidase has been shown to be more active in sugarcane plants exhibiting SCYL symptoms than in uninfected or asymptomatic plants⁸⁵. Our results provide further evidence that these enzymes are in fact involved in the response to SCYLV. Other candidates harboring markers associated with SCYLV resistance encode proteins with motifs previously associated with SCYLV resistance³¹: Sobic.001G023900, encoding a GATA zinc finger protein, and Sobic.001G200200 and Zm00001d037864_T030, both of which encode proteins containing tetratricopeptide repeats.

Other annotations included classic participants in more general disease resistance mechanisms, such as several genes encoding proteins with leucine-rich repeat (LRR) motifs. These structures are part of nucleotide-binding LRR (NBS-LRR) proteins, receptors that detect pathogen-associated proteins and elicit effector-triggered immunity⁸⁶. Hence, NBS-LRRs have been widely shown to determine resistance to viruses in plants^87,88,89. We found two LRR proteins (Sobic.008G156600.1 and Sobic.001G452600.1), one disease resistance NBS-LRR (Sobic.007G085400.1) and one N-terminal leucine zipper NBS-LRR resistance gene analog (Sobic.005G203500.1) associated with SCYLV resistance. Furthermore, we annotated one gene (Sobic.009G204800.1) that encodes a precursor of a receptor-like serine/threonine–protein kinase within the family to which LRR proteins belong. Yang et al.³¹ also identified a serine/threonine-protein kinase associated with SCYLV resistance. We consider these proteins highly promising candidates to be involved in the recognition of infection by SCYLV, which could trigger response mechanisms leading to the restriction of the virus. Further virus–host interaction studies involving these proteins might help confirm this hypothesis, which would represent a major breakthrough in understanding resistance to SCYLV.

Two other annotated genes were readily identified as involved in plant disease resistance mechanisms. Sobic.010G131300.2 contains a Bric-a-Brac, Tramtrack, Broad Complex/Pox virus and Zinc finger (BTB/POZ) domain, while Sobic.007G198400.1 contains two BTB domains, as well as ankyrin repeat regions. These domains are present in and are essential for the function of NONEXPRESSOR OF PATHOGENESIS-RELATED GENES 1 (NPR1), a central player in plant disease responses^90,91. This family of transcription factors is involved in establishing both systemic acquired resistance and induced systemic resistance⁹², mediating the crosstalk between salicylic acid and jasmonic acid/ethylene responses⁹³. Correspondingly, NPR1 has been widely shown to be involved in resistance to viruses^94,95, and it is therefore reasonable to suggest its participation in the response to infection by SCYLV.

We also found a few candidates with putative roles in the RNA interference mechanism, one of the most prominent processes that contribute to resistance against viruses in plants. This is the case for Sobic.001G214000.1, which encodes a Dicer. Dicers are part of a mechanism known as RNA silencing, recognizing and cleaving long double-stranded RNA molecules into mature small RNAs that guide the cleavage of viral mRNAs and disrupt virus replication⁹⁶; accordingly, they have been linked to resistance to viruses in several plant species^97,98. Another gene possibly involved in RNA interference is Sobic.009G121100, encoding a protein related to calmodulin binding—a calcium transducer that regulates the activity of various proteins with diverse functions⁹⁹ and has been widely implicated in viral resistance in plants, often playing roles in RNA interference^100,101,102. Consequently, we consider these genes promising candidates in the regulation of SCYLV replication and spread in planta, as well as in the development of SCYL symptoms.

Two additional annotations linked to the mechanism of RNA interference are those of genes encoding proteins with F-box domains, SbRio.03G158900 and Sobic.002G019750.1. F-box proteins are involved in virus resistance in several plant species^103,104. A particularly interesting case is FBW2 from Arabidopsis thaliana, which regulates AGO1, an Argonaute protein with a central role in RNA silencing¹⁰⁵ and repression of target viral RNAs^106,107,108. Even more intriguing is the fact that one of the proteins encoded by the SCYLV genome, P0, contains an F-box-like domain and mediates the destabilization of AGO1, leading to the suppression of host gene silencing¹⁰⁹. Whether the F-box proteins identified here play active roles in silencing of SCYLV remains a question to be investigated by further studies.

Other annotated genes may represent host factors involved in various steps of plant–virus interactions. For instance, Sobic.010G160500.4 encodes an RNA helicase with a DEAD-box domain, which is often coopted by viruses to promote viral translation or replication, thus playing important roles in regulating infection^110,111,112. Similarly, soluble N-ethylmaleimide-sensitive-factor attachment protein receptor (SNARE) proteins such as Sobic.001G528000.1 are essential in the biogenesis and fusion of vesicles of several plant viruses^{113,114,115,116}. We also found one gene encoding a myosin (Sobic.002G108000.1) and two genes related to kinesin (Sobic.001G346600.1 and Sobic.001G399200.2), all filament-associated motor proteins involved in the transport of organelles¹¹⁷. In a few cases, both myosins^118,119,120 and kinesins¹²¹ have been shown to be involved in viral intercellular movement through poorly understood mechanisms. One last interesting annotation was Sobic.003G101500.1, a protein with a DNAJ domain. DNAJs have been shown to interact with proteins of various plant viruses and to be associated with resistance, sometimes being crucial for virus infection and spread^{122,123,124,125}. We consider these genes to be promising candidates as host cofactors in the response to SCYLV infection.

In conclusion, this array of genome-wide analyses allowed us to detect markers significantly associated with SCYLV resistance in sugarcane. If validated, these markers represent an especially valuable resource for sugarcane breeding programs, as the results can be directly employed in marker-assisted strategies for the early selection of clones. The annotation of several genes wherein these markers are located revealed many candidates with long-established and pivotal roles in viral disease resistance, further demonstrating the efficiency of the methods employed for this purpose. Additionally, this annotation provides valuable insights into the unexplored mechanisms possibly involved in sugarcane’s response to infection by SCYLV, introducing new candidates whose role in this process can be further investigated in future studies.

Material and methods

Plant material and inoculation

The plant material and inoculation methods employed in the present study are described by Burbano et al.¹²⁶ and are in compliance with local and national regulations. The experimental population consisted of a panel of 97 sugarcane genotypes comprising wild germplasm accessions of S. officinarum, S. spontaneum and Saccharum robustum; traditional sugarcane and energy cane clones; and commercial cultivars originating from Brazilian breeding programs (Supplementary Table 1). To ensure plant infection with SCYLV, a field nursery was established in March 2016 at the Advanced Centre for Technological Research in Sugarcane Agribusiness located in Ribeirão Preto, São Paulo, Brazil (4°52′34″ W, 21°12′50″ S). Seedlings from sprouted setts of each genotype were planted in 1-m plots with an interplot spacing of 1.5 m. The cultivar SP71-6163, which is highly susceptible to SCYLV¹⁵, was interspersed with the panel genotypes. M. sacchari vector aphids were reared on RT-PCR tested SCYLV-infected SP71-6163 plants. After an acquisition access period of at least 48 h, aphids were released weekly in the field nursery in July 2016. After plant growth, setts obtained from this nursery were used to install a field experiment following a randomized complete block design with three blocks in May 2017. Plants were grown in 1-m-long three-row plots with row-to-row and interplot spacings of 1.5 and 2 m, respectively. Each row contained two plants, totaling six plants of each genotype per plot. To further assist infection by SCYLV, the cultivar SP71-6163 was planted in the borders and between blocks, and M. sacchari aphids were again released in the field weekly for 5 months, starting from November 2017.

Phenotyping

Plants were phenotyped in two crop seasons: plant cane in June 2018 and ratoon cane in July 2019. The severity of SCYL symptoms was assessed by three independent evaluators, who classified the top visible dewlap leaves (TVDLs) of each plot using a diagrammatic scale established by Burbano et al.¹²⁶, as shown in Supplementary Fig. 1. In the same week as symptom evaluation was performed, fragments from the median region of at least one TVDL per plot were collected and stored at − 80 °C until processing. Total RNA was extracted from this tissue using TRIzol (Invitrogen, Carlsbad, USA). Samples were subjected to an additional purification process consisting of three steps: (1) mixing equal volumes of RNA extract and chloroform, (2) precipitating the RNA overnight with 2.5 volumes of 100% ethanol and (3) a conventional cleaning step with 70% ethanol. RNA was then quantified on a NanoDrop 2000 spectrophotometer (Thermo Scientific, Waltham, USA) and subjected to electrophoresis on a 1% agarose gel stained with ethidium bromide for integrity checks. Samples were next diluted, treated with RNase-Free RQ1 DNase (Promega, Madison, USA), quantified and diluted again for standardization, and converted to cDNA using the ImProm-II Reverse Transcription System kit (Promega, Madison, USA).

The SCYLV titer in each sample was determined by qPCR using GoTaq qPCR Master Mix (Promega, Madison, USA) on a Bio-Rad CFX384 Touch detection system (Bio-Rad, Philadelphia, USA). Two viral quantification methodologies were employed—one relative and one absolute—using primers and conditions as described by Chinnaraja and Viswanathan¹²⁷. For both methods, a set of primers was used to amplify a 181-bp fragment from SCYLV ORF3 (YLSRT). For the relative quantification, an additional set of primers was used to amplify a 156-bp fragment of the 25S subunit of sugarcane ribosomal RNA (25SrRNA), used as an internal control. The 2^−ΔΔCT method¹²⁸ was used to correct cycle threshold (CT) values; the sample with the highest CT and a melting temperature of 82.5 ± 0.5 °C for the YLSRT primers was used as a control for phenotyping in each year. The absolute quantification followed the methodology described by Chinnaraja et al.³⁹. A pGEM-T Easy vector (Promega, Madison, USA) cloned with a 450-bp fragment from SCYLV ORF3 previously amplified by RT-PCR was used to construct a serial dilution curve with six points and tenfold dilutions between points, which were amplified on qPCR plates. All reactions were performed using three technical replicates.