Meta-analysis of GWAS in canola blackleg (Leptosphaeria maculans) disease traits demonstrates increased power from imputed whole-genome sequence

Blackleg disease causes yield losses in canola (Brassica napus L.). To identify resistance genes and genomic regions, genome-wide association studies (GWAS) of 585 diverse winter and spring canola accessions were performed using imputed whole-genome sequence (WGS) and transcriptome genotype-by-sequencing (GBSt). Blackleg disease phenotypes were collected across three years in six trials. GWAS were performed in several ways and their respective power was judged by the number of significant single nucleotide polymorphisms (SNP), the false discovery rate (FDR), and the percentage of SNP that validated in additional field trials in two subsequent years. WGS GWAS with 1,234,708 million SNP detected a larger number of significant SNP, achieved a lower FDR and a higher validation rate than GBSt with 64,072 SNP. A meta-analysis combining survival and average internal infection resulted in lower FDR but also lower validation rates. The meta-analysis GWAS identified 79 genomic regions (674 SNP) conferring potential resistance to L. maculans. While several GWAS signals localised in regions of known Rlm genes, fifty-three new potential resistance regions were detected. Seventeen regions had underlying genes with putative functions related to disease defence or stress response in Arabidopsis thaliana. This study provides insight into the genetic architecture and potential molecular mechanisms underlying canola L. maculans resistance.

www.nature.com/scientificreports/ Canola (also known as rapeseed, Brassica napus L., genome AACC, 2n = 38) is one of the most important sources of edible oil providing 14% of global production 1 . Strong canola growth demand requires critical increases in overall production and productivity, which can in part be achieved through overcoming biotic and abiotic stress. One of the major biotic stressors is blackleg disease, which makes it a high priority trait for canola breeding programs. It is caused by the fungal pathogen Lepthosphaeria maculans resulting in large yield losses, thereby becoming a major threat to world canola production. Most canola growing regions experience losses due to blackleg disease, with Canada and France having reported between 5 and 40% average yield losses annually 2,3 . Similar yield reduction is common in Australia, however, in some regions losses as severe as 90% have been observed [4][5][6] . Genetic resistance to blackleg can be either qualitative (race specific) or quantitative (non-race specific) and commercial cultivars tend to have qualitative resistance genes that, when effective, confer complete immunity against blackleg. However, quantitative resistance has the potential to only restrict the development of L. maculans formation throughout plant development stages [7][8][9][10] .
Elucidating the genetic basis of blackleg resistance as well as identification and localization of resistance genes/ genomic regions to L. maculans is important to understand the genetic architecture of resistance to the pathogen and is an important step to mitigate its catastrophic effect. Early studies focused on bi-parental mapping of quantitative trait loci (QTL) using RFLP, SSR, AFLP, RAPD or DArT markers for blackleg disease resistance [11][12][13][14][15][16][17][18][19][20][21][22][23] . In addition, several environmentally stable QTL were identified and localized in A and C sub-genomes conferring resistance to L. maculans [24][25][26] . Moreover, assembling pangenomes has been applied to discover novel genes in various species 27,28 . A study by Hurgobin et al. 29 pointed out that one of the causes for gene absence or presence in B. napus was homoeologous exchange and it contributed towards resistance gene diversity.
In the past decade, SNPs have become the genetic marker of choice over multi-allelic markers due to their abundance in the genome and high polymorphism rate 30 . Raman et al. 31 used SNP-chip based linkage maps and found a major L. maculans resistance locus on chromosome A07 that accounted for up to 69.2% of genetic variation and was mapped around the major locus Rlm4. While the genetic variance explained with bi-parental populations is often estimated to be high, these loci explain much less of the variation generally observed in more diverse germplasm. Fikere et al. 32 demonstrated that a large proportion of genetic variation for blackleg disease resistance in diverse canola population remains unexplained by currently published QTL studies. Nevertheless, bi-parental QTL mapping experiments were an important first step towards increasing the understanding of blackleg disease architecture in canola.
The development of advanced sequencing and SNP chip technologies, and the associated drop in genotyping prices, has enabled the use of thousands of SNP markers across the genome. The increased SNP density allows for the application of single SNP regressions or genome-wide association studies (GWAS) in more diverse germplasm. Genome-wide association studies have rapidly increased the discovery of new putative candidate genes for complex traits in a variety of crops such as rice 33 , maize 34 , soybean 35 , and flax 36,37 . In canola, several GWAS have been reported for agronomic and quality traits. Few GWAS have been reported on major canola diseases (Leptosphaeria maculans and Sclerotinia stem rot) 22,38,39 . Most recently, Raman et al. 31 used a panel of 179 diverse canola lines genotyped with 18,804 SNPs to perform a GWAS that discovered a new L. maculans resistance gene Rlm12 on chromosome A01 as well as additional QTL.
We investigated two main ways to increase the power of GWAS using (1) meta-analysis of single trait GWAS to identify pleiotropic loci and (2) imputed whole-genome sequence (WGS) data. A meta-analysis GWAS combines single trait results from different traits or different environments 40,41 . This would increase the power to uncover pleiotropic loci affecting multiple traits. The increased power has been demonstrated in several studies in human and the meta-analyses have been applied widely in human and animal genetics as they avoid sharing of original genotypes and phenotypes [40][41][42][43][44] .
The use of WGS variants substantially increases the number of markers and thus increases linkage disequilibrium (LD) between markers and causative mutations. Furthermore, the causative mutations may themselves be genotyped directly. This is proposed to lead to more effective pinpointing of genes or genomic regions linked with complex traits of interest. However, WGS for a large number of individuals is expensive. A more costeffective approach is to impute WGS into individuals that have been genotyped with a lower density assay. A number of imputation software programs are available, such as MACH 45 , IMPUTE 46 , BEAGLE 47 , FImpute 48 , and Eagle 49 + Minimac 50 , and they vary in computational efficiency and imputation accuracies 51,52 . Imputation has been implemented in several crops, a study by Torkamaneh et al. 53 achieved high accuracy when imputing from low-coverage genotyping-by-sequencing (GBS) to WGS in soybean. Similarly, Shi et al. 54 examined the potential of exome sequence imputation in wheat. These studies emphasized the need to carefully evaluate imputation algorithms in new datasets as large differences in performance exist.
Here we report on an imputation analysis from transcriptome GBSt to WGS and GWAS in a relatively large population of spring and winter canola phenotyped across three growing seasons. We impute WGS into 585 canola lines genotyped with GBSt and analyse this dataset with single-trait GWAS with the aim of: (i) comparing the power of GWAS using GBSt and imputed WGS and (ii) performing meta-analysis of GWAS for two blackleg resistance traits (survival and internal infection) and identify genomic regions potentially harbouring new resistance genes associated with blackleg traits as well as compare its power to single-trait GWAS.

Results
Phenotypic variation and correlation coefficient. In-field phenotypes for 585 diverse canola lines in field trials from 2015 to 2017 were adjusted for spatial variation and best linear unbiased estimates (BLUEs) were calculated per trial and per line ( Table 1). Performance of emergence and internal infection varied across the 2015-2017 growing seasons. For instance, in the 2015 trial, seedling emergence counts were substantially lower at Green Lake than Wickliffe, possibly due to drier conditions at Green Lake. Blackleg traits (survival rate and  Fig. S1. Moderately positive correlation coefficients were observed between similar traits at different sites and years. A strong negative correlation coefficient was recorded between average internal infection and survival rate (r = − 0.86), following the expected trend of high infection being associated with low survival ( Supplementary Fig. S1).
Genomic data and imputation accuracy. We evaluated the imputation accuracy (correlation of observed and imputed genotypes) and concordance rate of WGS in 153 spring and winter type canola lines (Supplementary Table S1a). The low-density transcriptome genotyping-by-sequencing (GBSt) of these 153 lines consisted of 64,072 SNPs that overlapped with the 6 million WGS variants. Testing included FImpute and Eagle + Minimac. The correlation between observed and imputed genotypes was considerably higher in Eagle + Minimac3 compared to FImpute in all the sub-population validation scenarios (Supplementary Table S1a). For instance, the correlations for Eagle + Minimac3 imputed spring lines were 0.78 when both spring + winter and spring only reference sets were used, whereas with FImpute they were 0.68 and 0.71, respectively. The accuracy of the imputation followed the same trend, being higher for Eagle + Minimac3, compared to FImpute, with the exception of the imputed spring lines from the spring reference set (0.84 vs 0.87; Supplementary Table S1a). A decline in the accuracy of imputation was observed for low-frequency SNPs in both imputation algorithms. This decline was less pronounced when using Eagle + Minimac3 ( Supplementary Fig. S2). Based on these results we used Eagle + Minimac3 to impute all lines from 64,072 to 6 million SNPs.
A total of 585 individuals genotyped at 64,072 SNP were imputed to WGS using 153 individuals with 6 million SNPs as a reference set. Sporadic missing genotypes were imputed using Beagle 47 in both datasets followed by phasing with Eagle, which in turn was followed by the actual imputation using Minimac 50 . Based on the Minimac results, we set a Rsqr (squared correlation between imputed and posterior genotypes) threshold of > 0.1, which dropped the final SNP density to 1,234,708 (Supplementary Table S1b). Supplementary Table S1b also shows that other slightly altered pipelines resulted in lower Rsqr values. Imputation accuracy and the proportion of correctly imputed SNP markers (accuracy) are presented in Supplementary Table S1b. Beagle + Eagle + Minimac was chosen to impute whole-genome sequence for GWAS.
The genetic relatedness between 585 canola lines was investigated with a genomic relationship matrix as described in Yang et al. 55 . The population were clustered into two main categories (Fig. 1). The larger cluster defined the winter lines followed by the spring set. A few spring lines with winter background formed a separate cluster, confirming that these populations were originally derived from winter lines.
Single trait genome-wide association study and false discovery rate. Manhattan plots of singletrait GWAS using WGS and GBSt for emergence count, survival rate, and average internal infection at Wickliffe and Green Lake sites are shown in Supplementary Figs. S3abc, S4abc and S5 and QQ plots indicated that population structure was properly accounted for ( Supplementary Fig. S6). Using WGS, a considerably larger number of  www.nature.com/scientificreports/ significant SNPs than for the GBSt GWAS were detected at all P-value thresholds and for all traits ( Table 2). In general, similar peaks were identified for GBSt and WGS, but the WGS GWAS revealed additional peaks ( Fig. 2a  vs b). Furthermore, we repeated the WGS GWAS with hardcoded genotypes (most probable genotype coded as 0, 1, and 2), which generally identified a similar number of significant SNP at P ≤ 1 × 10 −4 and fewer significant SNP at P ≤ 1 × 10 −5 with higher FDR. The general trend of power based on FDR was, therefore, WGS dosage > WGS hardcoded > GBSt dosage (Fig. 3).  www.nature.com/scientificreports/   Tables 2, 3). Adding emergence GWAS into the meta-analysis increased the number of significant SNP, however, this trend stopped at P < 1 × 10 −6 ( Table 3). In addition, we compared a GWAS model that combined BLUEs from different sites to combining separate site GWAS using a meta-analysis ( Table 4). The meta-analyses detected more significant SNP at all p-value thresholds.
Validation and comparison of results from different GWAS strategies. We used the 2016 and 2017 datasets to validate and compare the different GWAS approaches. We implemented various validation strategies (VS) and quantified the most significant SNPs at P < 1 × 10 −5 identified in several disovery sets: single-trait GWAS analysis using GBSt (VS1); single-trait GWAS hardcoded (VS2); WGS GWAS within locations; single-trait dosage (VS3); multi-trait meta-analysis for blackleg traits using dosage WGS genotypes (VS4); full linear mixed model to combine locations per trait (VS5); meta-analysis combining 2 locations per trait (VS6); and ( Fig. 5; Supplementary Tables S2-S7). Validation strategy 1 (VS1) involved a single-trait GWAS-GBSt dosage discovery set during 2015 at Wickliffe and Green Lake sites with validation in single-trait GWAS-GBSt dosage in 2016 and 2017 trials for EME and AvInf. We found that a smaller number of significant SNPs validated in VS1 versus VS2, which validated the single-trait GWAS-WGS hardcoded discovery set in single-trait GWAS-hardcoded in the same datasets (Fig. 5). We further extended the validation strategy (VS2) to discover and validate significant SNPs detected   Table 3. Meta-analysis for specific to blackleg traits and combined all-traits for the 2015 trials. FDR = False discovery rate.

Meta-analysis for blackleg traits
Meta-analysis all-traits www.nature.com/scientificreports/ in single-trait GWAS-WGS dosage genotypes (VS3) and found VS3 to be superior in its validation rates at all p-value thresholds. Comparing VS2 and VS3 to VS1, we can conclude that imputed WGS GWAS were more powerful than GBSt GWAS in this dataset. A slightly different trend was observed when significant SNPs discovered in blackleg meta-analysis GWAS were validated using single-trait GWAS at 2016 and 2017 sites (VS4). Compared to VS1-3, we observed a higher validation rate for P < 10 −2 and P < 10 −4 in VS4, but a lower validation rate for P < 10 −3 and P < 10 −5 . This shows that meta-analysis improved the overall validation rate, but when specifically looking at highly significant SNP in both the discovery and validation GWAS, it validated a lower proportion of SNPs. Validation strategies 5 (VS5) and 6 (VS6) tested whether it is better to combine data per trait with a multi-location GWAS model (and fitting location as fixed) versus combining locations with a meta-analysis of single location GWAS, respectively. The validation rates observed clearly indicated that combining locations in a full model (VS5) is preferable to simply performing a meta-analysis of single location GWAS (VS6), which is to some extent expected. It is notable that for P < 10 −3 , the location full GWAS model (VS5) performed better than the blackleg meta-analysis (VS4). Validation rate (Vr) calculated as: Vr = Number of significant NPs acheived in the validation at P value Number of significant SNPs in the discovery set at P value Table 4. Meta-analysis using a combined best linear unbiased estimates (BLUEs) from the Wickliffe and Green Lake 2015 trials. EME = emergence count, surv = survival rate, AvInf = Average internal infection, FDR = false discovery rate.  www.nature.com/scientificreports/ Pinpointing candidate genomic regions identified for resistance to L. maculans. The number of significant SNPs (at P < 1 × 10 −4 ) in the single-trait GWAS analysis varied between blackleg traits and across experimental sites. In general, a higher number of significant SNPs were found at the Wickliffe site compared to Green Lake, while the number of discoveries for the traits within location was relatively similar.
The meta-analysis of blackleg GWAS discovered a larger number of additional SNPs when compared to the single trait GWAS. The most significant SNPs (P = 9.59E−07) were located on 3685 kb of chromosome A02 followed by ChrA08, ChrC03, and ChrC06 (Fig. 4, Supplementary Tables S8, S9 and S10). To identify potential candidate genes with putative resistance to L. maculans, the 674 significant SNPs obtained from meta-analysis were clustered into separate regions if the distance between SNP was more than 200 kb. A total of 79 potential regions were observed (Supplementary Tables S8 and S10). Furthermore, when we compared within the A and C sub-genomes, 51 genomic regions consisting of 344 significant SNPs were on the A sub-genome, while 330 SNP were within 28 regions on the C sub-genomes. The regions with the most significant SNPs were on chromosome C06 (157 SNP) followed by chromosome A02 (125 SNP) and A09 (64 SNP) and A03 (55 SNP). No significant SNP (P < 1 × 10 −5 ) were detected on C02, and very few significant SNP were observed on C01 and C05 (Supplementary Table S8, Fig. 4). Overall, many of the significant results detected in our study shared similar regions with known Rlm genes and QTLs found in linkage mapping populations.
Candidate gene mining for blackleg disease traits using BLAST. B. napus SNP effect annotations revealed that most significant SNP were intergenic, followed by intronic, synonymous, non-synonymous, stop gained and lost coding functions (Supplementary Tables S9 and S10). We performed a BLASTn against Arabidopsis thaliana for B. napus gene sequences within 30 kb up and downstream of the 674 significantly associated SNP P < 1 × 10 −5 from the blackleg meta-analysis. The BLAST cut-off e-value threshold was set as e−7. The percentage of identical matches, number of nucleic mismatches, expected e-value and bit score were estimated. We searched the BLASTn results with the following search terms; such as disease, resistance, fungus, pathogen, pathogenesis, stem canker, stress, bacteria and abnormality. Based on this we found potential candidate genes with known biotic and abiotic resistant QTLs associated with A. thaliana accounting for 17 out of the 79 genomic regions, with functions that positively related to disease resistance and plant stress.

Discussion
We have reported GWAS results for seedling emergence and blackleg traits (internal infection and survival) and we have validated and compared the results from different GWAS strategies across multiple locations and years. We conducted single-trait and multi-trait meta-analysis GWAS using WGS and GBSt in a diverse set of canola lines. The imputation to WGS was reasonably accurate and the imputation pipeline using Eagle + Minimac3 seemed to outperform FImpute. WGS-based GWAS were more powerful than using GBSt as indicated by the number of significant SNPs, the FDR, and the validation rates in additional years of field trial data. Meta-analysis identified more significant SNP than single trait GWAS with a similar FDR, but a smaller percentage of SNP were validated across years from the meta-analysis. GWAS using dosage were more powerful than hardcoded genotypes, where power was judged by the validation rates and FDR. Combining locations with multi-trait GWAS model increased validation rates substantially over combining single-trait GWAS results with a meta-analysis. The blackleg GWAS meta-analysis revealed 79 major genome regions (containing SNP P < 10 −4 ) putatively involved in resistance to L. maculans. The B. napus genes in the vicinity of these significant SNP were mapped with BLASTn against the A. thaliana genome, which revealed several genes with putative functions related to disease resistance.

Factors affecting the power of GWAS. The imputation of WGS into diverse reference population lines
enabled an increase in the sample size for the GWAS. We have shown that the accuracy of imputation was > 0.71. This level of accuracy is similar to what has been previously achieved in other crops 54 . However, it is substantially less than in human and other mammals [56][57][58] . Aside from the differences in the size of the WGS reference set, there are several possible reasons for this discrepancy. Mammals are diploid and their reference genome assemblies (particularly human) tend to have fewer errors than crop assemblies. Many crops are polyploid, often with significant homology between the sub-genomes in allopolyploids (e.g. canola) and this brings challenges to infer genotype dosage in autoploids. This adds complexity to inferring assemblies and to mapping reads to unique genome locations, which, in turn, leads to mapping errors and false positive SNPs. We have performed stringent quality control in our dataset in an attempt to reduce the number of erroneous SNPs and genotypes. Nevertheless, a large proportion of our imputed SNPs achieved only a low Minimac Rsq, which seems to indicate a reasonably large false positive SNP or genotyping error rate. Imputation accuracy could possibly be further improved with an improved B. napus genome assembly, a denser GBSt dataset coupled with greater fold coverage, a greater focus on sub-genome specific SNP, a larger WGS reference set, and more stringent quality control.
We have mitigated the effect of SNP and genotype uncertainty in two ways. First, we have imposed a Minimac Rsqr threshold of 0.1, which corresponds to an approximate empirical imputation accuracy of 0.4. Using only SNP exceeding this threshold is helpful in reducing the false positive rate and reducing the multiple testing problem by only interrogating better imputed SNPs. Secondly, we have investigated using both hardcoded (0,1,2) and dosage genotypes in our GWAS analysis to evaluate its effect on power. Dosage imputed WGS genotypes were found to be better at pinpointing causal mutations in dairy cattle 57 . Indeed, we found that dosage genotypes detected more significant SNPs with a lower FDR and with higher validation rates than hardcoded genotypes indicating that they improved the GWAS power. The improved power is thought to come from the modelling of uncertainty in the imputed genotype. If there is a lot of uncertainty the genotype will be closer to the heterozygote genotype, thereby reducing their influence in the analysis and potentially leading to fewer false positive associations. www.nature.com/scientificreports/ The largest increase in GWAS power was observed from using WGS imputed sequences rather than sequences obtained by GBSt alone. This phenomenon has been demonstrated in a number of species, including A. thaliana 59 ; rice 60 and in crop species in general, see a review by Edwards and Batley 61 . In our study, imputation to WGS accomplished two goals. Firstly, it increased the number of SNPs and therefore increased the linkage disequilibrium between markers to increase their association with causal loci. Secondly, WGS imputation augmented the dataset with intergenic markers, which can harbor regulatory variants. The increase in power has led to additional discoveries of regions involved in blackleg disease resistance. This initial resource of WGS individuals can be expanded and improved to make future GWAS more effective. Additionally, it remains to be tested whether WGS will improve genomic prediction.
The majority of GWAS are fitted as single traits and environments. We have combined single trait GWAS using a meta-analysis approach 42 . An increase in power was observed previously from meta-analyses, whose aim is to discover loci that affect multiple traits 42,[62][63][64] . Similarly, the advantages of large-scale meta-analysis for genetic mapping in plant were reviewed in several studies 59,65,66 and suggested that meta-analysis provides prospects for the identification of genes relevant to trait improvement. In our case combining internal infection and survival from the 2015 trials in a meta-analysis substantially increased the number of significant SNPs and reduced the FDR. Validating the meta-analysis SNPs in the internal infection GWAS collected from the 2016 and 2017 trials revealed a higher validation rate for higher p-value levels or a lower validation rate for lower p-values compared to single trait GWAS (Fig. 5, Supplementary Fig. S6).
Blackleg fungal populations are highly dynamic and multiple populations exist in the Victorian environment 31 . We attempted to mitigate this by growing the two blackleg disease nurseries in 2015 on stubble from the same cultivar. Nevertheless, disease pressure could have varied across the two 2015 sites. Meta-analysis may be less effective than single trait GWAS for highly significant, possibly fungal population specific, loci, while at the same time it could be more effective for less significant loci that may be components of quantitative resistance. The environment of the 2016 and 2017 trials used for validation was clearly managed differently (i.e. treatments with fungicide) to the 2015 trials. Overall, this, along with the smaller samples size of later trials, would have depressed the validation rates. However, as all GWAS strategies were assessed on the same datasets the comparisons are expected to be fair. If the original phenotype and genotype data are available, it is preferable to fit an appropriate within trait full model with a fixed effect of location to combine data rather than meta-analysis. In our case, the full model was more successful at validation as p values became more stringent, indicating a lower rate of false positives. Future work could investigate a multi-trait GWAS to a multi-trait meta-analysis of single trait GWAS.
There are additional ways to validate SNPs identified in discovery sets, for example using partially known lines from the previous year datasets, as was shown in maize 66 and rice 67 , and using different lines from independent nursery, as shown in bread wheat 68 . Alternatively, a cross-validation approach could be used for validation 40,69 . Furthermore linkage mapping approaches 31,70 in bi-parental populations have been used to validate significant markers identified in GWAS hits in canola and wheat.
Putative candidate regions and genes for resistance to L. maculans. Previous studies have identified major genes and genomic regions conferring resistance to L. maculans using primarily bi-parental populations and found many such regions are clustered on chromosome A07 and A10, see two reviews 71,72 . In addition, single-trait GWAS have, to date, identified > 334 putative loci for blackleg (L. maculans) disease resistance 31,38,73 .
We have arbitrarily defined significant genomic regions as separate if the distance between SNP was more than 200 kb. The size of region capturing potentially the same causative mutation is directly related to the amount of linkage disequilibrium (LD) in our population. Malmberg et al. 74 determined in the same population that LD dropped to 0.2 at approximately 200 kb distance between SNPs, a level which has been confirmed in other studies 25,75 . This resulted in a total of 79 potential genomic regions across the A and C sub-genomes, of which 52 were new. While the remaining 27 of these putative regions were near previously published Rlm genes and quantitative QTL, further follow-up is required to confirm that our signals are indeed for known Rlm genes. Without a physical location for many published QTL it is often difficult to confirm whether our significant SNPs overlapped with previous linkage study results. When comparing cM from linkage groups and our physical positions our estimates of overlap were only approximations.
Four regions were identified on A01 of which one is near the Rlm12 region and potentially other QTL [23][24][25]31 . On A02, all four regions overlapped with previous results and included the LepR1 resistance gene 21,24,25 . Not surprisingly, these regions included many of our most significant SNP, hence other studies have also found them. Chromosome A03 harboured eight potential regions of which only one had been previously identified 24,25 . Several of the new regions contained only one or two SNP, that meet the significance threshold, indicating that the signal in for these regions was less strong. Seven regions were found on A04 with one previously published 24,25 . One region on A05 confirmed a previous report 25 and four were new. Our GWAS detected LepR4 and other previously published QTL 24,25,76 as well as two new regions on A06. All seven regions on A07 overlapped with published resistance genes (Rlm1, Rlm3, Rlm4, Rlm9, LmFr1, LMR1 and LEM1) as well as quantitative QTL [11][12][13][14][15][16][17][18][19]21,23 . Five regions were found on A08, of which one overlapped with Rlm5 and several QTL [23][24][25]77 , and four regions were new. One region of A09 overlapped with several QTL 24,25 and three were novel. Two significant regions on A10 collocated with Rlm2, LepR3, BLMR1, BLMR2, and LepR2 16,49,76 . Our observed signals in these two regions are likely associated to quantitative resistance rather than known major resistance genes, because Rlm2/LepR3 and LepR2 are defeated genes in the Australian context.
A substantial number of regions potentially harbouring blackleg disease resistance were found on the C subgenome. Chromosome C01 contained one new and one previously known region 23 . No significant SNP were found on C02. Two and seven new regions were detected on C03 and C04, respectively, where one region on C04 was published 25  www.nature.com/scientificreports/ contained 80 significant SNP), of which one was previously discovered 23 . Rlm6, which was introgressed from B. juncea and is known to be present in Australian germplasm, was confirmed on C07 along with three new regions. Two regions were discovered on C08 24,25 , where one was new. Finally, C09 contained a new putative blackleg resistance region. Assembly annotation is a crucial component to understanding and interpreting GWAS results. Many of our GWAS regions fell into the intergenic class, this is similar to findings in other species including mammals. It has also been observed in plants and algae 78 . Intergenic variants may have potential roles in transcriptional or translational regulation. The annotation of regulatory regions is complex and requires techniques that provide information outside of genic regions. Some examples of such assays are MNase, ChIP sequencing and HiC that provide information on open and closed chromatin regions and DNA folding [11][12][13][14][15][16][17][18][19]38 .
Due to the significant homology of genes, information can be shared across species. We further compared genomic regions detected in A. thaliana with previously known L. maculans resistant canola genes and found that a gene near one of our significant SNP and overlapping with Rlm2 49 had high homology with the AT1G01170.2 gene localized in A. thaliana Chr1-74105-74443, which is thought to have a role in stress responsive conditions. As Rlm2 is thought to be defeated in Australia, we speculate that our signal captured other resistance loci in the same region. Similarly, a gene near the locus for the B. juncea-introgressed Rlm5 77 blackleg resistant gene aligned to the AT1G10090.1 gene in Chr1 of A. thaliana. This gene is thought to be involved in stress responsive dehydration. The Victorian pre-breeding lines in our dataset do contain B. juncea introgressions. Moreover, we showed that candidate resistance genes were mapped in the vicinity of known R genes in B. napus genome and A. thaliana genes responsible for various stressors. This has demonstrated the importance of additional genomic resources in Brassica species such as TAIR (https ://www.arabi dopsi s.org/) and Genoscope (https ://www.genos cope.cns.fr/brass icana pus/) to enhance the discovery of new genes using related model plants. Several studies in other species confirm the benefits of BLAST to align putative candidate genes detected in GWAS analysis in wild wheat, tree species 79  phenotype processing. Phenotypes were spatially adjusted with autocorrelation error (i.e. AR1 x AR1) models for field condition variability to generate Best Linear Unbiased Estimates (BLUEs) using ASReml 83 , as described in 32,84 . The model was as follows: where y ijk is the phenotype, µ is the overall mean, g i is the fixed effect for variety i , r j is the random effect for row j , c k is the random effect for column k fit as a spline, and e ijk is the residual. The variance due to lines (Vg) was estimated using the above model with g i fitted as random and broad-sense heritabilities were calculated as Vg/(Vg + Ve/nrep), where Ve was the residual variance and nrep was the number of replicates per line (nrep = 2 in all trials) 85 .
Transcriptome genotyping-by-sequencing (GBSt). A total of 585 Spring and Winter type canola lines were genotyped using the protocol described in Malmberg et al. 86 . Briefly, mRNA was extracted from leaf tissue y ijk = µ + g i + r j + c k + e ijk Scientific RepoRtS | (2020) 10:14300 | https://doi.org/10.1038/s41598-020-71274-6 www.nature.com/scientificreports/ (Dynabeads: Life Technologies) and used for library preparation for RNA sequencing (SureSelect: Agilent Technologies). Circa 3 million reads were generated per sample using either an Illumina HiSeq3000 or a NextSeq500. The resulting sequence data was adaptor and quality trimmed before aligning to the Darmor-bzh whole genome reference 87 using the Tophat2 algorithm 88 and SNP were identified with SAMtools mpileup 89 .
Whole genome sequencing (WGS). Whole genome sequence data from 153 samples covering the diversity of the 585 GBSt lines described in 74 were re-analyzed. The sequencing protocols were fully described in 86 . Briefly, whole genome libraries were generated for all samples and sequenced on an Illumina HiSeq3000 aiming for 10 × read depth coverage per sample. The resulting sequence data was adaptor and quality trimmed before aligning to the Darmor-bzh whole genome reference 87 using the BWA mem algorithm 90 , and SNP were identified with SAMtools mpileup 89 .
Quality control. We implemented stringent quality control thresholds on 6 million WGS variants based on minor allele frequency (MAF > 0.1), missing rate per SNPs and missing rate per sample (< 50%), genotype read depth (DP > 5), heterozygosity rate per SNP (Hetero > 0.4) followed by imputation accuracy from Minimac R-square (Rsqr > 0.1). Finally, genotypes were checked for SNP duplications and any duplicates were removed keeping one of the set in the refined data. The final dataset was checked for their MAF > 0.01, giving a total of 153 lines and 1,234,708 million SNPs for subsequent analysis.
Sequence imputation. We assessed the imputation accuracy of two imputation pipelines: FImpute (no pedigree option 48 and Eagle V2.3 49 followed by Minimac3 V2.0.1 50 in the set of 153 whole-genome sequenced lines via fivefold cross validation. Locations not present in the GBSt assay were masked in validation lines and imputed from the WGS data. The accuracy of imputation was assessed as the correlation and concordance of imputed and observed sequence genotypes. Additionally, we investigated whether it is beneficial to combine winter and spring lines for imputation or consider them separately. To do so, imputation was conducted within spring lines, within winter lines separately and using spring lines as a validation and winter lines as reference set and vice versa in a tenfold cross validation. After this evaluation, the Eagle2.3 Minimac3 pipeline was used for imputing the 585 lines from GBSt density to WGS using the entire set of 153 lines as the reference set. Minimac3 provides genotypes in full dosage format (i.e. real numbers ranging from 0 to 2) and in hardcoded genotype format (i.e. coded as 0, 1, and 2, for homozygous reference, heterozygous, and homozygous alternative respectively) for the imputed sequence variants. We investigated the effect of these formats on the power of GWAS.

Genome-wide association analysis. We used EMMAX (Effective Mixed Model Association eXpedited)
to perform GWAS 91 analysing one SNP at a time with a linear mixed model. EMMAX makes the simplifying and time-saving assumption that any given SNP's effect on the trait is typically small and therefore only estimates the model variance components once per analysis to account for population structure. EMMAX estimates the variance components with the REML model where y was a vector BLUEs, W and Z were incidence matrices, b was a vector of fixed effects including intercept and seasonal type (winter or spring), g ( g ∼ N 0, σ 2 g G ) and e ( e ∼ N 0, σ 2 e I ) were vectors of additive genetic effects and residuals, respectively, and σ 2 g was the additive genetic and σ 2 e was the residual variance. G was the genomic relationship matrix calculated following Yang et al. 55 . An F-test is then calculated per SNP using the estimates of the polygenic and residual terms from the variance component step. GWAS were run per site and year. An additional model of combined BLUEs per trait and location was fitted as a fixed effect was run to investigate the effect of combining sites within year.

Meta-analysis of single-trait GWAS.
The meta-analysis used the SNP effects from single-trait GWAS, as described in Bolormaa et al. 42 . A χ 2 test statistic was calculated as follows: where t i was number of traits k × 1 vector of the signed t-values of SNP i effects, i.e., beta/se, for the k traits; t i ′ was the transpose t i (1 × k); and V −1 was the inverse of the k × k correlation matrix, where the correlations were calculated for all signed t-values per pair of traits. The χ 2 value of each SNP were examined for significance based on a χ 2 distribution with k degrees of freedom. False discovery rates were calculated as FDR = P * T A × 100, where P is the p-value tested, T is the total number of SNP tested and A is the number of SNP that were significant at the p-value tested. In addition, meta-analyses were used to combine single-trait GWAS across the two locations and to combine the traits survival rate and internal infection from the 2015 trials.
Validation of GWAS in two additional years of field trial data. GWAS were analysed in several different ways. Our aim was to validate which method resulted in the most power. A priori it is difficult to determine which method is more powerful. One way to judge the utility of an approach, aside from FDR, is how many of the significant SNP remain significant in a new dataset. In this study, we used emergence and internal infection values from the four agronomical field trials in two additional seasons Performing BLAST for B. napus L. GWAS significant SNP against A. thaliana. In addition, to gain an understanding of the underlying gene functions, we used BLASTn (Basic Local Alignment Search Tool) analysis against the Arabdophsis thaliana sequence database (https ://www.arabi dopsi s.org/). Gene sequences within 30 kb up or downstream of significant SNPs detected in meta-analysis at the 10 −5 p-value were included in the BLASTn analysis. BLAST matches to multiple loci with the same top identity metrics were removed. Prediction of functional variant annotation of an individual SNPs was performed using SnpEff 92 using the B. napus Darmor-bzh genome annotation file v5 (gff3) and the whole genome reference sequence 87 . Details of the annotation procedures is provided in Malmberg et al. 74 . Finally, potential genes from the meta-analysis were used to define the regions of interest that contain potential candidate genes.