Introduction

Canola (also known as rapeseed, Brassica napus L., genome AACC, 2n = 38) is one of the most important sources of edible oil providing 14% of global production1. Strong canola growth demand requires critical increases in overall production and productivity, which can in part be achieved through overcoming biotic and abiotic stress. One of the major biotic stressors is blackleg disease, which makes it a high priority trait for canola breeding programs. It is caused by the fungal pathogen Lepthosphaeria maculans resulting in large yield losses, thereby becoming a major threat to world canola production. Most canola growing regions experience losses due to blackleg disease, with Canada and France having reported between 5 and 40% average yield losses annually2,3. Similar yield reduction is common in Australia, however, in some regions losses as severe as 90% have been observed4,5,6. Genetic resistance to blackleg can be either qualitative (race specific) or quantitative (non-race specific) and commercial cultivars tend to have qualitative resistance genes that, when effective, confer complete immunity against blackleg. However, quantitative resistance has the potential to only restrict the development of L. maculans formation throughout plant development stages7,8,9,10.

Elucidating the genetic basis of blackleg resistance as well as identification and localization of resistance genes/genomic regions to L. maculans is important to understand the genetic architecture of resistance to the pathogen and is an important step to mitigate its catastrophic effect. Early studies focused on bi-parental mapping of quantitative trait loci (QTL) using RFLP, SSR, AFLP, RAPD or DArT markers for blackleg disease resistance11,12,13,14,15,16,17,18,19,20,21,22,23. In addition, several environmentally stable QTL were identified and localized in A and C sub-genomes conferring resistance to L. maculans24,25,26. Moreover, assembling pangenomes has been applied to discover novel genes in various species27,28. A study by Hurgobin et al.29 pointed out that one of the causes for gene absence or presence in B. napus was homoeologous exchange and it contributed towards resistance gene diversity.

In the past decade, SNPs have become the genetic marker of choice over multi-allelic markers due to their abundance in the genome and high polymorphism rate30. Raman et al.31 used SNP-chip based linkage maps and found a major L. maculans resistance locus on chromosome A07 that accounted for up to 69.2% of genetic variation and was mapped around the major locus Rlm4. While the genetic variance explained with bi-parental populations is often estimated to be high, these loci explain much less of the variation generally observed in more diverse germplasm. Fikere et al.32 demonstrated that a large proportion of genetic variation for blackleg disease resistance in diverse canola population remains unexplained by currently published QTL studies. Nevertheless, bi-parental QTL mapping experiments were an important first step towards increasing the understanding of blackleg disease architecture in canola.

The development of advanced sequencing and SNP chip technologies, and the associated drop in genotyping prices, has enabled the use of thousands of SNP markers across the genome. The increased SNP density allows for the application of single SNP regressions or genome-wide association studies (GWAS) in more diverse germplasm. Genome-wide association studies have rapidly increased the discovery of new putative candidate genes for complex traits in a variety of crops such as rice33, maize34, soybean35, and flax36,37. In canola, several GWAS have been reported for agronomic and quality traits. Few GWAS have been reported on major canola diseases (Leptosphaeria maculans and Sclerotinia stem rot)22,38,39. Most recently, Raman et al.31 used a panel of 179 diverse canola lines genotyped with 18,804 SNPs to perform a GWAS that discovered a new L. maculans resistance gene Rlm12 on chromosome A01 as well as additional QTL.

We investigated two main ways to increase the power of GWAS using (1) meta-analysis of single trait GWAS to identify pleiotropic loci and (2) imputed whole-genome sequence (WGS) data. A meta-analysis GWAS combines single trait results from different traits or different environments40,41. This would increase the power to uncover pleiotropic loci affecting multiple traits. The increased power has been demonstrated in several studies in human and the meta-analyses have been applied widely in human and animal genetics as they avoid sharing of original genotypes and phenotypes40,41,42,43,44.

The use of WGS variants substantially increases the number of markers and thus increases linkage disequilibrium (LD) between markers and causative mutations. Furthermore, the causative mutations may themselves be genotyped directly. This is proposed to lead to more effective pinpointing of genes or genomic regions linked with complex traits of interest. However, WGS for a large number of individuals is expensive. A more cost-effective approach is to impute WGS into individuals that have been genotyped with a lower density assay. A number of imputation software programs are available, such as MACH45, IMPUTE46, BEAGLE47, FImpute48, and Eagle49 + Minimac50, and they vary in computational efficiency and imputation accuracies51,52. Imputation has been implemented in several crops, a study by Torkamaneh et al.53 achieved high accuracy when imputing from low-coverage genotyping-by-sequencing (GBS) to WGS in soybean. Similarly, Shi et al.54 examined the potential of exome sequence imputation in wheat. These studies emphasized the need to carefully evaluate imputation algorithms in new datasets as large differences in performance exist.

Here we report on an imputation analysis from transcriptome GBSt to WGS and GWAS in a relatively large population of spring and winter canola phenotyped across three growing seasons. We impute WGS into 585 canola lines genotyped with GBSt and analyse this dataset with single-trait GWAS with the aim of: (i) comparing the power of GWAS using GBSt and imputed WGS and (ii) performing meta-analysis of GWAS for two blackleg resistance traits (survival and internal infection) and identify genomic regions potentially harbouring new resistance genes associated with blackleg traits as well as compare its power to single-trait GWAS.

Results

Phenotypic variation and correlation coefficient

In-field phenotypes for 585 diverse canola lines in field trials from 2015 to 2017 were adjusted for spatial variation and best linear unbiased estimates (BLUEs) were calculated per trial and per line (Table 1). Performance of emergence and internal infection varied across the 2015–2017 growing seasons. For instance, in the 2015 trial, seedling emergence counts were substantially lower at Green Lake than Wickliffe, possibly due to drier conditions at Green Lake. Blackleg traits (survival rate and average internal infection) also varied across sites. As expected, in the blackleg managed 2016 and 2017 trials average internal infection percentages were lower (range 7.22–19.2) than in the 2015 blackleg disease nurseries. Broad sense heritability of the three traits ranged from 0.38 (emergence count) at Mininera to 0.80 (survival rate) at the Wickliffe site (Table 1).

Table 1 Phenotypic summary after spatial adjustment of the three traits in the association panel across environments during 2015–2017 growing seasons for 585 winter and spring lines in 2015, and 168 spring lines grown in 2016 and 2017.

The phenotypic correlations of BLUEs between three traits at 5 locations (Wickliffe, Green Lake, Mininera, Horsham irrigated 2016 and 2017, and Horsham rain-fed) in three years (2015, 2016, 2017) are shown in Supplementary Fig. S1. Moderately positive correlation coefficients were observed between similar traits at different sites and years. A strong negative correlation coefficient was recorded between average internal infection and survival rate (r =  − 0.86), following the expected trend of high infection being associated with low survival (Supplementary Fig. S1).

Genomic data and imputation accuracy

We evaluated the imputation accuracy (correlation of observed and imputed genotypes) and concordance rate of WGS in 153 spring and winter type canola lines (Supplementary Table S1a). The low-density transcriptome genotyping-by-sequencing (GBSt) of these 153 lines consisted of 64,072 SNPs that overlapped with the 6 million WGS variants. Testing included FImpute and Eagle + Minimac. The correlation between observed and imputed genotypes was considerably higher in Eagle + Minimac3 compared to FImpute in all the sub-population validation scenarios (Supplementary Table S1a). For instance, the correlations for Eagle + Minimac3 imputed spring lines were 0.78 when both spring + winter and spring only reference sets were used, whereas with FImpute they were 0.68 and 0.71, respectively. The accuracy of the imputation followed the same trend, being higher for Eagle + Minimac3, compared to FImpute, with the exception of the imputed spring lines from the spring reference set (0.84 vs 0.87; Supplementary Table S1a). A decline in the accuracy of imputation was observed for low-frequency SNPs in both imputation algorithms. This decline was less pronounced when using Eagle + Minimac3 (Supplementary Fig. S2). Based on these results we used Eagle + Minimac3 to impute all lines from 64,072 to 6 million SNPs.

A total of 585 individuals genotyped at 64,072 SNP were imputed to WGS using 153 individuals with 6 million SNPs as a reference set. Sporadic missing genotypes were imputed using Beagle47 in both datasets followed by phasing with Eagle, which in turn was followed by the actual imputation using Minimac50. Based on the Minimac results, we set a Rsqr (squared correlation between imputed and posterior genotypes) threshold of > 0.1, which dropped the final SNP density to 1,234,708 (Supplementary Table S1b). Supplementary Table S1b also shows that other slightly altered pipelines resulted in lower Rsqr values. Imputation accuracy and the proportion of correctly imputed SNP markers (accuracy) are presented in Supplementary Table S1b. Beagle + Eagle + Minimac was chosen to impute whole-genome sequence for GWAS.

The genetic relatedness between 585 canola lines was investigated with a genomic relationship matrix as described in Yang et al.55. The population were clustered into two main categories (Fig. 1). The larger cluster defined the winter lines followed by the spring set. A few spring lines with winter background formed a separate cluster, confirming that these populations were originally derived from winter lines.

Figure 1
figure 1

Heat map of the genomic relationship matrix for 585 diverse canola lines using the imputed 1,234,708 SNP markers. Mixed = spring canola lines with winter background; darker colour indicates greater relatedness. Figure produced in R3.6.

Single trait genome-wide association study and false discovery rate

Manhattan plots of single-trait GWAS using WGS and GBSt for emergence count, survival rate, and average internal infection at Wickliffe and Green Lake sites are shown in Supplementary Figs. S3abc, S4abc and S5 and QQ plots indicated that population structure was properly accounted for (Supplementary Fig. S6). Using WGS, a considerably larger number of significant SNPs than for the GBSt GWAS were detected at all P-value thresholds and for all traits (Table 2). In general, similar peaks were identified for GBSt and WGS, but the WGS GWAS revealed additional peaks (Fig. 2a vs b). Furthermore, we repeated the WGS GWAS with hardcoded genotypes (most probable genotype coded as 0, 1, and 2), which generally identified a similar number of significant SNP at \(P \le 1 \times 10^{ - 4}\) and fewer significant SNP at \(P \le 1 \times 10^{ - 5}\) with higher FDR. The general trend of power based on FDR was, therefore, WGS dosage > WGS hardcoded > GBSt dosage (Fig. 3).

Table 2 Comparison of GBSt versus WGS GWAS as well as two alternative ways to code genotypes (dosage and hardcoded) for emergence (EME), average internal infection (AvInf), and survival (Surv) at Wickliffe (WL) and Green Lake (GL) sites in 2015, where FDR is the false discovery rate.
Figure 2
figure 2

Increased power of WGS and meta-analysis of GWAS for internal infection as demonstrated by Manhattan plots for (a) transcriptomic genotyping-by-sequence (GBSt) at Wickliffe and (b) imputed whole-genome sequence (WGS) at Wickliffe and (c) multi-trait meta-analysis of GWAS for internal infection and survival at the two 2015 blackleg trials. The colour bar shows SNP density every 1Mbp. Figure produced in R3.6 using CMplot function (https://github.com/YinLiLin/R-CMplot).

Figure 3
figure 3

False discovery rate (at \(P < 1 \times 10^{ - 4}\)) for GWAS using GBSt dosage, WGS dosage, WGS hard coded genotypes based on blackleg disease prone site (Wickliffe) and meta-analysis for blackleg disease traits (internal infection and survival). Figure produced in R3.6.

Meta-analysis of GWAS

Two meta-analyses were carried out combining the single-trait WGS dosage GWAS to increase power and to identify pleiotropic loci. One combined internal infection and survival (blackleg related) phenotypes across the two sites in 2015 (Figs. 2c, 4), and the other combined all traits and sites in that year. For the blackleg related traits, 101 SNPs were significant (\(P < 1 \times 10^{ - 6}\)) with an FDR of 1.22%, which is a considerable improvement when compared to the single-trait WGS GWAS (Figs. 2c, 3; Tables 2, 3). Adding emergence GWAS into the meta-analysis increased the number of significant SNP, however, this trend stopped at \(P < 1 \times 10^{ - 6}\) (Table 3). In addition, we compared a GWAS model that combined BLUEs from different sites to combining separate site GWAS using a meta-analysis (Table 4). The meta-analyses detected more significant SNP at all p-value thresholds.

Figure 4
figure 4

A circular plot showing potential candidate genomic region associated with blackleg traits in Brassica napus L. across A and C sub-genome at \(P < 1 \times 10^{ - 3} ; 10^{-4};10^{ - 5} ;\;{\text{and}}\;10^{ - 6}\) thresholds shown as black, red, yellow and green lines, respectively. The top 50 significant SNPs across the regions are indicated in the circle. Figure produced in R3.6.

Table 3 Meta-analysis for specific to blackleg traits and combined all-traits for the 2015 trials. FDR = False discovery rate.
Table 4 Meta-analysis using a combined best linear unbiased estimates (BLUEs) from the Wickliffe and Green Lake 2015 trials.

Validation and comparison of results from different GWAS strategies

We used the 2016 and 2017 datasets to validate and compare the different GWAS approaches. We implemented various validation strategies (VS) and quantified the most significant SNPs at \(P < 1 \times 10^{ - 5}\) identified in several disovery sets: single-trait GWAS analysis using GBSt (VS1); single-trait GWAS hardcoded (VS2); WGS GWAS within locations; single-trait dosage (VS3); multi-trait meta-analysis for blackleg traits using dosage WGS genotypes (VS4); full linear mixed model to combine locations per trait (VS5); meta-analysis combining 2 locations per trait (VS6); and (Fig. 5; Supplementary Tables S2S7).

Figure 5
figure 5

Mean validation rates (%) of different validation strategies across four P-value thresholds (\(P < 1 \times 10^{ - 2}\); \(P < 1 \times 10^{ - 3}\); \(P < 1 \times 10^{ - 4} ; P < 1 \times 10^{ - 5}\)). Input information in Supplementary Tables S1S6. Validation strategies are VS1) single-GBSt-to-single-GBSt: single-trait GWAS 2015 in dosage GBSt in discovery set and single-trait GWAS in GBSt dosage validation in 2016 and 2017 VS2) single-012-to-single-012: WGS single-trait GWAS 2015 in 012 (integer) in a discovery set and WGS single-trait GWAS in 012 (integer) validation in 2016 and 2017 VS3) single-to-single: single-trait dosage GWAS 2015 in discovery set and single-trait dosage GWAS validation in 2016 and 2017 VS4) meta-to-single: meta-analysis GWAS of blackleg traits in 2015 and validation in single trait GWAS in 2016 and 2017 VS5) CombModel-to-single: Combined model sites per traits GWAS in 2015 and validation in single trait GWAS in 2016 and 2017. VS6) meta-AvInt-to-singleAvInt: meta-analysis for AvInt GWAS in 2015 and validation in single-trait AvInt GWAS in 2016 and 2017. Figure produced in R3.6.

Validation strategy 1 (VS1) involved a single-trait GWAS–GBSt dosage discovery set during 2015 at Wickliffe and Green Lake sites with validation in single-trait GWAS–GBSt dosage in 2016 and 2017 trials for EME and AvInf. We found that a smaller number of significant SNPs validated in VS1 versus VS2, which validated the single-trait GWAS–WGS hardcoded discovery set in single-trait GWAS–hardcoded in the same datasets (Fig. 5). We further extended the validation strategy (VS2) to discover and validate significant SNPs detected in single-trait GWAS–WGS dosage genotypes (VS3) and found VS3 to be superior in its validation rates at all p-value thresholds. Comparing VS2 and VS3 to VS1, we can conclude that imputed WGS GWAS were more powerful than GBSt GWAS in this dataset. A slightly different trend was observed when significant SNPs discovered in blackleg meta-analysis GWAS were validated using single-trait GWAS at 2016 and 2017 sites (VS4). Compared to VS1-3, we observed a higher validation rate for P < 10−2 and P < 10−4 in VS4, but a lower validation rate for P < 10−3 and P < 10−5. This shows that meta-analysis improved the overall validation rate, but when specifically looking at highly significant SNP in both the discovery and validation GWAS, it validated a lower proportion of SNPs. Validation strategies 5 (VS5) and 6 (VS6) tested whether it is better to combine data per trait with a multi-location GWAS model (and fitting location as fixed) versus combining locations with a meta-analysis of single location GWAS, respectively. The validation rates observed clearly indicated that combining locations in a full model (VS5) is preferable to simply performing a meta-analysis of single location GWAS (VS6), which is to some extent expected. It is notable that for P < 10−3, the location full GWAS model (VS5) performed better than the blackleg meta-analysis (VS4).

Validation rate (Vr) calculated as:

$${\text{Vr}} = { }\frac{{{\text{Number}}\,{\text{of}}\,{\text{significant }}\,{\text{NPs}}\,{\text{acheived}}\,{\text{in}}\,{\text{the}}\,{\text{validation}}\,{\text{at}}\,{\text{P}}\,{\text{value}}}}{{{\text{Number}}\,{\text{of}}\,{\text{significant}}\,{\text{SNPs}}\,{\text{in}}\,{\text{the}}\,{\text{discovery}}\,{\text{set}}\,{\text{at}}\,{\text{P}}\,{\text{value}}}}$$

Pinpointing candidate genomic regions identified for resistance to L. maculans

The number of significant SNPs (at \(P < 1 \times 10^{ - 4}\)) in the single-trait GWAS analysis varied between blackleg traits and across experimental sites. In general, a higher number of significant SNPs were found at the Wickliffe site compared to Green Lake, while the number of discoveries for the traits within location was relatively similar.

The meta-analysis of blackleg GWAS discovered a larger number of additional SNPs when compared to the single trait GWAS. The most significant SNPs (P = 9.59E−07) were located on 3685 kb of chromosome A02 followed by ChrA08, ChrC03, and ChrC06 (Fig. 4, Supplementary Tables S8, S9 and S10). To identify potential candidate genes with putative resistance to L. maculans, the 674 significant SNPs obtained from meta-analysis were clustered into separate regions if the distance between SNP was more than 200 kb. A total of 79 potential regions were observed (Supplementary Tables S8 and S10). Furthermore, when we compared within the A and C sub-genomes, 51 genomic regions consisting of 344 significant SNPs were on the A sub-genome, while 330 SNP were within 28 regions on the C sub-genomes. The regions with the most significant SNPs were on chromosome C06 (157 SNP) followed by chromosome A02 (125 SNP) and A09 (64 SNP) and A03 (55 SNP). No significant SNP (P < 1 × 10−5) were detected on C02, and very few significant SNP were observed on C01 and C05 (Supplementary Table S8, Fig. 4). Overall, many of the significant results detected in our study shared similar regions with known Rlm genes and QTLs found in linkage mapping populations.

Candidate gene mining for blackleg disease traits using BLAST

B. napus SNP effect annotations revealed that most significant SNP were intergenic, followed by intronic, synonymous, non-synonymous, stop gained and lost coding functions (Supplementary Tables S9 and S10). We performed a BLASTn against Arabidopsis thaliana for B. napus gene sequences within 30 kb up and downstream of the 674 significantly associated SNP \(\left( {P < 1 \times 10^{ - 5} } \right)\) from the blackleg meta-analysis. The BLAST cut-off e-value threshold was set as e−7. The percentage of identical matches, number of nucleic mismatches, expected e-value and bit score were estimated. We searched the BLASTn results with the following search terms; such as disease, resistance, fungus, pathogen, pathogenesis, stem canker, stress, bacteria and abnormality. Based on this we found potential candidate genes with known biotic and abiotic resistant QTLs associated with A. thaliana accounting for 17 out of the 79 genomic regions, with functions that positively related to disease resistance and plant stress.

Discussion

We have reported GWAS results for seedling emergence and blackleg traits (internal infection and survival) and we have validated and compared the results from different GWAS strategies across multiple locations and years. We conducted single-trait and multi-trait meta-analysis GWAS using WGS and GBSt in a diverse set of canola lines. The imputation to WGS was reasonably accurate and the imputation pipeline using Eagle + Minimac3 seemed to outperform FImpute. WGS-based GWAS were more powerful than using GBSt as indicated by the number of significant SNPs, the FDR, and the validation rates in additional years of field trial data. Meta-analysis identified more significant SNP than single trait GWAS with a similar FDR, but a smaller percentage of SNP were validated across years from the meta-analysis. GWAS using dosage were more powerful than hardcoded genotypes, where power was judged by the validation rates and FDR. Combining locations with multi-trait GWAS model increased validation rates substantially over combining single-trait GWAS results with a meta-analysis. The blackleg GWAS meta-analysis revealed 79 major genome regions (containing SNP P < 10−4) putatively involved in resistance to L. maculans. The B. napus genes in the vicinity of these significant SNP were mapped with BLASTn against the A. thaliana genome, which revealed several genes with putative functions related to disease resistance.

Factors affecting the power of GWAS

The imputation of WGS into diverse reference population lines enabled an increase in the sample size for the GWAS. We have shown that the accuracy of imputation was > 0.71. This level of accuracy is similar to what has been previously achieved in other crops54. However, it is substantially less than in human and other mammals56,57,58. Aside from the differences in the size of the WGS reference set, there are several possible reasons for this discrepancy. Mammals are diploid and their reference genome assemblies (particularly human) tend to have fewer errors than crop assemblies. Many crops are polyploid, often with significant homology between the sub-genomes in allopolyploids (e.g. canola) and this brings challenges to infer genotype dosage in autoploids. This adds complexity to inferring assemblies and to mapping reads to unique genome locations, which, in turn, leads to mapping errors and false positive SNPs. We have performed stringent quality control in our dataset in an attempt to reduce the number of erroneous SNPs and genotypes. Nevertheless, a large proportion of our imputed SNPs achieved only a low Minimac Rsq, which seems to indicate a reasonably large false positive SNP or genotyping error rate. Imputation accuracy could possibly be further improved with an improved B. napus genome assembly, a denser GBSt dataset coupled with greater fold coverage, a greater focus on sub-genome specific SNP, a larger WGS reference set, and more stringent quality control.

We have mitigated the effect of SNP and genotype uncertainty in two ways. First, we have imposed a Minimac Rsqr threshold of 0.1, which corresponds to an approximate empirical imputation accuracy of 0.4. Using only SNP exceeding this threshold is helpful in reducing the false positive rate and reducing the multiple testing problem by only interrogating better imputed SNPs. Secondly, we have investigated using both hardcoded (0,1,2) and dosage genotypes in our GWAS analysis to evaluate its effect on power. Dosage imputed WGS genotypes were found to be better at pinpointing causal mutations in dairy cattle57. Indeed, we found that dosage genotypes detected more significant SNPs with a lower FDR and with higher validation rates than hardcoded genotypes indicating that they improved the GWAS power. The improved power is thought to come from the modelling of uncertainty in the imputed genotype. If there is a lot of uncertainty the genotype will be closer to the heterozygote genotype, thereby reducing their influence in the analysis and potentially leading to fewer false positive associations.

The largest increase in GWAS power was observed from using WGS imputed sequences rather than sequences obtained by GBSt alone. This phenomenon has been demonstrated in a number of species, including A. thaliana59; rice60 and in crop species in general, see a review by Edwards and Batley61. In our study, imputation to WGS accomplished two goals. Firstly, it increased the number of SNPs and therefore increased the linkage disequilibrium between markers to increase their association with causal loci. Secondly, WGS imputation augmented the dataset with intergenic markers, which can harbor regulatory variants. The increase in power has led to additional discoveries of regions involved in blackleg disease resistance. This initial resource of WGS individuals can be expanded and improved to make future GWAS more effective. Additionally, it remains to be tested whether WGS will improve genomic prediction.

The majority of GWAS are fitted as single traits and environments. We have combined single trait GWAS using a meta-analysis approach42. An increase in power was observed previously from meta-analyses, whose aim is to discover loci that affect multiple traits42,62,63,64. Similarly, the advantages of large-scale meta-analysis for genetic mapping in plant were reviewed in several studies59,65,66 and suggested that meta-analysis provides prospects for the identification of genes relevant to trait improvement. In our case combining internal infection and survival from the 2015 trials in a meta-analysis substantially increased the number of significant SNPs and reduced the FDR. Validating the meta-analysis SNPs in the internal infection GWAS collected from the 2016 and 2017 trials revealed a higher validation rate for higher p-value levels or a lower validation rate for lower p-values compared to single trait GWAS (Fig. 5, Supplementary Fig. S6).

Blackleg fungal populations are highly dynamic and multiple populations exist in the Victorian environment31. We attempted to mitigate this by growing the two blackleg disease nurseries in 2015 on stubble from the same cultivar. Nevertheless, disease pressure could have varied across the two 2015 sites. Meta-analysis may be less effective than single trait GWAS for highly significant, possibly fungal population specific, loci, while at the same time it could be more effective for less significant loci that may be components of quantitative resistance. The environment of the 2016 and 2017 trials used for validation was clearly managed differently (i.e. treatments with fungicide) to the 2015 trials. Overall, this, along with the smaller samples size of later trials, would have depressed the validation rates. However, as all GWAS strategies were assessed on the same datasets the comparisons are expected to be fair. If the original phenotype and genotype data are available, it is preferable to fit an appropriate within trait full model with a fixed effect of location to combine data rather than meta-analysis. In our case, the full model was more successful at validation as p values became more stringent, indicating a lower rate of false positives. Future work could investigate a multi-trait GWAS to a multi-trait meta-analysis of single trait GWAS.

There are additional ways to validate SNPs identified in discovery sets, for example using partially known lines from the previous year datasets, as was shown in maize66 and rice67 , and using different lines from independent nursery, as shown in bread wheat68. Alternatively, a cross-validation approach could be used for validation40,69. Furthermore linkage mapping approaches31,70 in bi-parental populations have been used to validate significant markers identified in GWAS hits in canola and wheat.

Putative candidate regions and genes for resistance to L. maculans

Previous studies have identified major genes and genomic regions conferring resistance to L. maculans using primarily bi-parental populations and found many such regions are clustered on chromosome A07 and A10, see two reviews71,72. In addition, single-trait GWAS have, to date, identified > 334 putative loci for blackleg (L. maculans) disease resistance31,38,73.

We have arbitrarily defined significant genomic regions as separate if the distance between SNP was more than 200 kb. The size of region capturing potentially the same causative mutation is directly related to the amount of linkage disequilibrium (LD) in our population. Malmberg et al.74 determined in the same population that LD dropped to 0.2 at approximately 200 kb distance between SNPs, a level which has been confirmed in other studies25,75. This resulted in a total of 79 potential genomic regions across the A and C sub-genomes, of which 52 were new. While the remaining 27 of these putative regions were near previously published Rlm genes and quantitative QTL, further follow-up is required to confirm that our signals are indeed for known Rlm genes. Without a physical location for many published QTL it is often difficult to confirm whether our significant SNPs overlapped with previous linkage study results. When comparing cM from linkage groups and our physical positions our estimates of overlap were only approximations.

Four regions were identified on A01 of which one is near the Rlm12 region and potentially other QTL23,24,25,31. On A02, all four regions overlapped with previous results and included the LepR1 resistance gene21,24,25. Not surprisingly, these regions included many of our most significant SNP, hence other studies have also found them. Chromosome A03 harboured eight potential regions of which only one had been previously identified24,25. Several of the new regions contained only one or two SNP, that meet the significance threshold, indicating that the signal in for these regions was less strong. Seven regions were found on A04 with one previously published24,25. One region on A05 confirmed a previous report25 and four were new. Our GWAS detected LepR4 and other previously published QTL24,25,76 as well as two new regions on A06. All seven regions on A07 overlapped with published resistance genes (Rlm1, Rlm3, Rlm4, Rlm9, LmFr1, LMR1 and LEM1) as well as quantitative QTL11,12,13,14,15,16,17,18,19,21,23. Five regions were found on A08, of which one overlapped with Rlm5 and several QTL23,24,25,77, and four regions were new. One region of A09 overlapped with several QTL24,25 and three were novel. Two significant regions on A10 collocated with Rlm2, LepR3, BLMR1, BLMR2, and LepR216,49,76. Our observed signals in these two regions are likely associated to quantitative resistance rather than known major resistance genes, because Rlm2/LepR3 and LepR2 are defeated genes in the Australian context.

A substantial number of regions potentially harbouring blackleg disease resistance were found on the C sub-genome. Chromosome C01 contained one new and one previously known region23. No significant SNP were found on C02. Two and seven new regions were detected on C03 and C04, respectively, where one region on C04 was published25. C05 harboured one new region and nine new regions were found on C06 (one new region contained 80 significant SNP), of which one was previously discovered23. Rlm6, which was introgressed from B. juncea and is known to be present in Australian germplasm, was confirmed on C07 along with three new regions. Two regions were discovered on C0824,25, where one was new. Finally, C09 contained a new putative blackleg resistance region.

Assembly annotation is a crucial component to understanding and interpreting GWAS results. Many of our GWAS regions fell into the intergenic class, this is similar to findings in other species including mammals. It has also been observed in plants and algae78. Intergenic variants may have potential roles in transcriptional or translational regulation. The annotation of regulatory regions is complex and requires techniques that provide information outside of genic regions. Some examples of such assays are MNase, ChIP sequencing and HiC that provide information on open and closed chromatin regions and DNA folding11,12,13,14,15,16,17,18,19,38.

Due to the significant homology of genes, information can be shared across species. We further compared genomic regions detected in A. thaliana with previously known L. maculans resistant canola genes and found that a gene near one of our significant SNP and overlapping with Rlm249 had high homology with the AT1G01170.2 gene localized in A. thaliana Chr1-74105–74443, which is thought to have a role in stress responsive conditions. As Rlm2 is thought to be defeated in Australia, we speculate that our signal captured other resistance loci in the same region. Similarly, a gene near the locus for the B. juncea-introgressed Rlm577 blackleg resistant gene aligned to the AT1G10090.1 gene in Chr1 of A. thaliana. This gene is thought to be involved in stress responsive dehydration. The Victorian pre-breeding lines in our dataset do contain B. juncea introgressions. Moreover, we showed that candidate resistance genes were mapped in the vicinity of known R genes in B. napus genome and A. thaliana genes responsible for various stressors. This has demonstrated the importance of additional genomic resources in Brassica species such as TAIR (https://www.arabidopsis.org/) and Genoscope (https://www.genoscope.cns.fr/brassicanapus/) to enhance the discovery of new genes using related model plants. Several studies in other species confirm the benefits of BLAST to align putative candidate genes detected in GWAS analysis in wild wheat, tree species79 and canola31; B. juncea80; B. oleracea81. Overall, understanding the nature of genetic inheritance of a trait and genomic regions underlying association with blackleg traits will provide a new insight into the genetic architecture of the trait and accelerates the development of resistant cultivars.

Material and methods

Plant materials and trait measurements

We used data from 585 canola lines (391 winter and 194 spring types) grown at Wickliffe (37.665839° S, 142.754126° E) and Green Lake (36.768420° S, 142.264679° E), Victoria, Australia during 2015 growing season, previously described in Fikere et al.32 (Supplementary Table S11). Similarly, we conducted additional agronomic field trials under irrigated (Horsham irrigated 2016 and 2017) and rain-fed (Mininera 2016 and Horsham rain-fed 2017) conditions during the 2016 and 2017 growing seasons. Data from these agronomic trials (seedling emergence and internal infection) were used for the validation step of this study. The 2015 trial was a single-row disease nursery for blackleg sown in the canola stubble retained from the previous year’s crop (ATR-Gem). The trial was designed in AGROBASE using a randomized complete block design with a check variety grid every 10th row (var. Trigold). Two replicates were used at each location. The agronomic trials conducted in 2016 and 2017 were randomized incomplete block designs with 2–3 reps in each location, sown in 7.5 m2 plots. The following phenotypes were recorded from the disease nursery: emergence count (number of plants emerged 6 weeks after sowing), blackleg traits such as adult plant survival rate and average internal infection of the stem at maturity. The agronomic trials were protected from blackleg disease twice, by using fungicide at sowing (impact-in-furrow) and a treatment at the 6–8 leaf stage with Prosaro 420 SC foliar fungicide. Thus, the blackleg management protocol was different in years 2016 and 2017, and these trials were only used for validation purposes. To determine the internal infection of the stem, we employed a modified method of the “Blackleg Canker Test” used in the Australian National Variety Trials (Grain Research Development Corporation, personal communications on NVT “The Protocols”, version 4, 9th July 2014). Briefly, in the 2015 trial, a total of 20 plants per row were sampled from a minimum of 7 randomly chosen positions (i.e. 3 consecutive plants from 7 widely spaced positions within each row, including any dead or lodged plants). For the 2016 and 2017 trials, 5 consecutive plants in a middle row in each plot were sampled to generally survey for any level of blackleg in these fungicide protected trials. The sampled plants were then cut with secateurs at the crown and the cross section of the stem was examined. The area of the stem infected by L. maculans was recorded as a percentage of affected area of the total stem cross-section, as described in the Spring Blackleg Management Guide82. Plot values were used in analyses and, for average internal infection, this was the mean value of multiple stems in each plot.

Phenotype processing

Phenotypes were spatially adjusted with autocorrelation error (i.e. AR1 x AR1) models for field condition variability to generate Best Linear Unbiased Estimates (BLUEs) using ASReml83, as described in32,84. The model was as follows:

$$y_{ijk} = \mu + g_{i} + r_{j} + c_{k} + e_{ijk}$$

where \(y_{ijk}\) is the phenotype, \(\mu\) is the overall mean, \(g_{i}\) is the fixed effect for variety \(i\), \(r_{j}\) is the random effect for row \(j\), \(c_{k}\) is the random effect for column \(k\) fit as a spline, and \(e_{ijk}\) is the residual. The variance due to lines (Vg) was estimated using the above model with \(g_{i}\) fitted as random and broad-sense heritabilities were calculated as Vg/(Vg + Ve/nrep), where Ve was the residual variance and nrep was the number of replicates per line (nrep = 2 in all trials)85.

Transcriptome genotyping-by-sequencing (GBSt)

A total of 585 Spring and Winter type canola lines were genotyped using the protocol described in Malmberg et al.86. Briefly, mRNA was extracted from leaf tissue (Dynabeads: Life Technologies) and used for library preparation for RNA sequencing (SureSelect: Agilent Technologies). Circa 3 million reads were generated per sample using either an Illumina HiSeq3000 or a NextSeq500. The resulting sequence data was adaptor and quality trimmed before aligning to the Darmor-bzh whole genome reference87 using the Tophat2 algorithm88 and SNP were identified with SAMtools mpileup89.

Whole genome sequencing (WGS)

Whole genome sequence data from 153 samples covering the diversity of the 585 GBSt lines described in74 were re-analyzed. The sequencing protocols were fully described in86. Briefly, whole genome libraries were generated for all samples and sequenced on an Illumina HiSeq3000 aiming for 10 × read depth coverage per sample. The resulting sequence data was adaptor and quality trimmed before aligning to the Darmor-bzh whole genome reference87 using the BWA mem algorithm90, and SNP were identified with SAMtools mpileup89.

Quality control

We implemented stringent quality control thresholds on 6 million WGS variants based on minor allele frequency (MAF > 0.1), missing rate per SNPs and missing rate per sample (< 50%), genotype read depth (DP > 5), heterozygosity rate per SNP (Hetero > 0.4) followed by imputation accuracy from Minimac R-square (Rsqr > 0.1). Finally, genotypes were checked for SNP duplications and any duplicates were removed keeping one of the set in the refined data. The final dataset was checked for their MAF > 0.01, giving a total of 153 lines and 1,234,708 million SNPs for subsequent analysis.

Sequence imputation

We assessed the imputation accuracy of two imputation pipelines: FImpute (no pedigree option48 and Eagle V2.349 followed by Minimac3 V2.0.150 in the set of 153 whole-genome sequenced lines via fivefold cross validation. Locations not present in the GBSt assay were masked in validation lines and imputed from the WGS data. The accuracy of imputation was assessed as the correlation and concordance of imputed and observed sequence genotypes. Additionally, we investigated whether it is beneficial to combine winter and spring lines for imputation or consider them separately. To do so, imputation was conducted within spring lines, within winter lines separately and using spring lines as a validation and winter lines as reference set and vice versa in a tenfold cross validation. After this evaluation, the Eagle2.3 Minimac3 pipeline was used for imputing the 585 lines from GBSt density to WGS using the entire set of 153 lines as the reference set. Minimac3 provides genotypes in full dosage format (i.e. real numbers ranging from 0 to 2) and in hardcoded genotype format (i.e. coded as 0, 1, and 2, for homozygous reference, heterozygous, and homozygous alternative respectively) for the imputed sequence variants. We investigated the effect of these formats on the power of GWAS.

Genome-wide association analysis

We used EMMAX (Effective Mixed Model Association eXpedited) to perform GWAS91 analysing one SNP at a time with a linear mixed model. EMMAX makes the simplifying and time-saving assumption that any given SNP’s effect on the trait is typically small and therefore only estimates the model variance components once per analysis to account for population structure. EMMAX estimates the variance components with the REML model

$${\mathbf{y}} = {\mathbf{Wb}} + {\mathbf{Zg}} + {\mathbf{e}}$$

where y was a vector BLUEs, W and Z were incidence matrices, b was a vector of fixed effects including intercept and seasonal type (winter or spring), g (\({\mathbf{g}} \sim N\user2{ }\left( {0, \sigma_{g}^{2} {\mathbf{G}}} \right))\) and e (\({\mathbf{e}}\user2{ }\sim \user2{ }N\left( {0,{ }\sigma_{e}^{2} {\mathbf{I}}} \right))\) were vectors of additive genetic effects and residuals, respectively, and \(\sigma_{g}^{2}\) was the additive genetic and \(\sigma_{e}^{2}\) was the residual variance. G was the genomic relationship matrix calculated following Yang et al.55. An F-test is then calculated per SNP using the estimates of the polygenic and residual terms from the variance component step. GWAS were run per site and year. An additional model of combined BLUEs per trait and location was fitted as a fixed effect was run to investigate the effect of combining sites within year.

Meta-analysis of single-trait GWAS

The meta-analysis used the SNP effects from single-trait GWAS, as described in Bolormaa et al.42. A \(\chi^{2}\) test statistic was calculated as follows:

$$\chi^{2} = t_{i}^{^{\prime}} V^{ - 1} t_{i}$$

where \(t_{i}\) was number of traits k × 1 vector of the signed t-values of SNPi effects, i.e., beta/se, for the k traits; ti′ was the transpose ti (1 × k); and V−1 was the inverse of the k × k correlation matrix, where the correlations were calculated for all signed t-values per pair of traits. The \(\chi^{2}\) value of each SNP were examined for significance based on a \(\chi^{2}\) distribution with k degrees of freedom. False discovery rates were calculated as \(FDR = \frac{P * T}{A} \times 100,\) where P is the p-value tested, T is the total number of SNP tested and A is the number of SNP that were significant at the p-value tested. In addition, meta-analyses were used to combine single-trait GWAS across the two locations and to combine the traits survival rate and internal infection from the 2015 trials.

Validation of GWAS in two additional years of field trial data

GWAS were analysed in several different ways. Our aim was to validate which method resulted in the most power. A priori it is difficult to determine which method is more powerful. One way to judge the utility of an approach, aside from FDR, is how many of the significant SNP remain significant in a new dataset. In this study, we used emergence and internal infection values from the four agronomical field trials in two additional seasons (2016 and 2017) for validation. We performed the following validation strategies (VS): (VS1) single-GBSt-to-single-GBSt: single-trait GWAS 2015 in dosage GBSt in discovery set and single-trait GWAS in GBSt dosage validation in 2016 and 2017; (VS2) single-012-to-single-012: WGS single-trait GWAS 2015 in 012 (integer) in a discovery set and WGS single-trait GWAS in 012 (integer) validation in 2016 and 2017; (VS3) single-to-single: WGS single-trait dosage GWAS 2015 in discovery set and WGS single-trait dosage GWAS validation in 2016 and 2017; (VS4) meta-to-single: WGS meta-analysis GWAS of blackleg traits in 2015 and validation in WGS single trait GWAS in 2016 and 2017; (VS5) CombModel-to-single: WGS combined model sites per traits GWAS in 2015 and validation in WGS single trait GWAS in 2016 and 2017; (VS6) meta-AvInt-to-singleAvInt: WGS meta-analysis for AvInt GWAS in 2015 and validation in WGS single-trait AvInt GWAS in 2016 and 2017. Significance thresholds were chosen based on false discovery rates.

Performing BLAST for B. napus L. GWAS significant SNP against A. thaliana

In addition, to gain an understanding of the underlying gene functions, we used BLASTn (Basic Local Alignment Search Tool) analysis against the Arabdophsis thaliana sequence database (https://www.arabidopsis.org/). Gene sequences within 30 kb up or downstream of significant SNPs detected in meta-analysis at the 10−5 p-value were included in the BLASTn analysis. BLAST matches to multiple loci with the same top identity metrics were removed. Prediction of functional variant annotation of an individual SNPs was performed using SnpEff92 using the B. napus Darmor-bzh genome annotation file v5 (gff3) and the whole genome reference sequence87. Details of the annotation procedures is provided in Malmberg et al.74. Finally, potential genes from the meta-analysis were used to define the regions of interest that contain potential candidate genes.