Analysis of genome-wide variants through bulked segregant RNA sequencing reveals a major gene for resistance to Plasmodiophora brassicae in Brassica oleracea

Two cabbage (Brassica oleracea) cultivars ‘Tekila’ and ‘Kilaherb’ were identified as resistant to several pathotypes of Plasmodiophora brassicae. In this study, we identified a clubroot resistance gene (Rcr7) in ‘Tekila’ for resistance to pathotype 3 of P. brassicae from a segregating population derived from ‘Tekila’ crossed with the susceptible line T010000DH3. Genetic mapping was performed by identifying the percentage of polymorphic variants (PPV), a new method proposed in this study, through bulked segregant RNA sequencing. Chromosome C7 carried the highest PPV (42%) compared to the 30–34% in the remaining chromosomes. A peak with PPV (56–73%) was found within the physical interval 41–44 Mb, which indicated that Rcr7 might be located in this region. Kompetitive Allele-Specific PCR was used to confirm the association of Rcr7 with SNPs in the region. Rcr7 was flanked by two SNP markers and co-segregated with three SNP markers in the segregating population of 465 plants. Seven genes encoding TIR-NBS-LRR disease resistance proteins were identified in the target region, but only two genes, Bo7g108760 and Bo7g109000, were expressed. Resistance to pathotype 5X was also mapped to the same region as Rcr7. B. oleracea lines including ‘Kilaherb’ were tested with five SNP markers for Rcr7 and for resistance to pathotype 3; 11 of 25 lines were resistant, but ‘Kilaherb’ was the only line that carried the SNP alleles associated with Rcr7. The presence of Rcr7 in ‘Kilaherb’ for resistance to both pathotypes 3 and 5X was confirmed through linkage analysis.

Sequence alignment and read mapping. Six bulks were generated; three from the R plants and three from the S plants. RNA from each individual bulk was sequenced, resulted in six RNA sequence files. The total sequence counts were over 44 million (M) for R bulks, and 41 M for the S bulks. Approximately 80% and 83% of the total counts were assembled to reference genome in R and S pools, respectively. The total length of coding sequences assigned to chromosomes in the reference genome was 62.6 Mb. The total accumulated length of coding sequences aligned to chromosomes was ~447 Mb for each of the R and S pools. This gives an estimated 7-fold depth of coverage of the gene coding portion of the reference genome 38 . Transcriptome haplotytping. A haplotype profile was established based on the alignment between the coding sequences of the reference genome and RNA-Seq from the R and S pools. In total, 89 haplotypes were identified, with 48 biallelic and 41 triallelic types (Supplementary Table S1). Biallelic haplotypes were the most frequent within the coding region, with a genome-wide frequency (GWF) of 145,151. In contrast, triallelic types were rare, with a GWF of 170 (Supplementary Table S1). Also, 34% of variants at triallelic sites were non-synonymous and 66% were synonymous. Chromosome C3 contained the highest number of biallelic haplotypes, with chromosome-wide haplotype frequency (CWF) of 23,906. Chromosome C6 had the lowest number, with a CWF of 11,956. Chromosome C7 had the highest number of triallelic haplotypes, with a CWF of 32, while C6 had the lowest number, with a CWF of 7 (Supplementary Table S1).

Analysis of variants and mapping of Rcr7 through analysis of PPV.
Based on alignment with the reference genome, about 156 K variants were detected in either R or S pools, with 154 K SNPs and 2.6 K InDels in each pool (Supplementary Table S2). Also, 36% of variants per pool were nonsynonymous and 64% were synonymous (Supplementary Table S2). Overall, 36.5% of the SNPs and 51.4% of the InDels were polymorphic ( Supplementary Fig. S1).
Chromosome C7 carried the highest PPV (42%) compared to the other eight chromosomes (30-34%) (Fig. 1a). This indicated that Rcr7 was likely located on C7, based on PPV. In addition, a PPV peak (56-73%) was found within the physical interval 41-44 Mb (Fig. 1b), which indicated that Rcr7 likely resided in this region of chromosome C7.  (Fig. 3). The physical distance between these two markers was 230,966 bases. There were 36 genes, including 6 genes that encoded disease resistance proteins and 1 gene that encoded disease resistance-responsive protein in this region, based on Blast2Go search (Supplementary Table S4). Blast search at http://www.arabidopsis.org/wublast/index2.jsp indicated that all seven disease resistance genes encoded TNL proteins. The number of polymorphic variants (SNP and InDel) uniquely identified from the R bulk was further assessed in the seven genes. Five TNL genes (Bo7g108830, Bo7g108840, Bo7g108850, Bo7g108870 and Bo7g109070) in the region did not show any expression and no short reads were assembled into the reference genome, so no polymorphic variants could be identified in the genes ( Table 2). Short reads were identified in two TNL genes, Bo7g108760 and Bo7g109000, in both the R and S bulks ( Table 2). The coding region in Bo7g108760 is 2724 bp in length ( Table 2). One polymorphic InDel was identified ( Table 2, Fig. 3). The coding region in Bo7g109000 is 564 bp in length and no polymorphic SNPs or InDels between the R and S bulks were identified ( Genetic mapping of clubroot resistance to pathotype 3 of P. brassicae in 'Kilaherb'. The two flanking markers and three co-segregating markers (Table 1) were assessed on additional 25 accessions of B. oleracea, including 'Kilaherb' (Table 3). Eleven accessions were resistant to pathotype 3 of P. brassicae, with 0 DSI, but only 'Kilaherb' carried the SNP alleles associated with Rcr7. To determine if the clubroot resistance in 'Kilaherb' was associated with Rcr7, a segregating population consisting of 50 F 1 plants from the cross 'Kilaherb' × T010000DH3 was assessed with the eight SNP markers identified previously. The population segregated in a 1:1 ratio (25R and 25S) to pathotype 3 of P. brassicae, indicating that 'Kilaherb' carried a single dominant clubroot resistance gene. All eight SNP markers associated with Rcr7 in 'Tekila' were also associated with resistance in 'Kilaherb' (Fig. 4). Three SNP markers (SNP_C7_34, SNP_C7_43 and SNP_C7_68) that co-segregated with Rcr7 also co-segregated with resistance in 'Kilaherb' , confirming that 'Kilaherb' carries Rcr7.

Confirmation of
Genetic mapping of resistance to pathotype 5X of P. brassicae in 'Tekila' and 'Kilaherb'. 'Tekila' and 'Kilaherb' were resistant to pathotype 5X of P. brassicae. To test if the resistance to 5X was associated with Rcr7, a subset of the 'Tekila' and 'Kilaherb' segregating populations comprising 88 F 1 plants each were inoculated with pathotype 5X. There were 50 R and 38S plants in the 'Tekila' population and 43R and 55S plants in the 'Kilaherb' population. Segregation for R and S in both populations was consistent with an expected ratio of 1:1 (χ 2 = 1.64 and P = 0.20 in the 'Tekila' population and χ 2 = 1.47 and P = 0.23 in the 'Kilaherb' population). When both populations were test with the eight SNP markers linked to Rcr7, all of the markers co-segregated with resistance to pathotype 5x in 'Tekila' and the genetic map for resistance to pathotype 5X for 'Kilaherb' was exactly the same as that for resistance to pathotype 3 (Fig. 4).

Discussion
Until recently, genetic mapping to identify genetic loci governing traits of interest was time-consuming and laborious. Development of high throughput sequencing (HTS) technology has made sequencing of whole genomes comparatively quick and affordable. One application of HTS is mapping by sequencing (MBS). MBS has been used to map causal genes in the genomes of organisms such as Arabidopsis thaliana 3,39,40 , Caenorhabditis elegans 41 , wheat 35 and maize 36 . A previous study identified a high proportion of PPV on chromosome A03 of B. rapa adjacent to the clubroot resistance gene Rcr1 28 . The current study used NGS and identification of PPV to map a major clubroot resistance gene from B. oleracea, which was designated as Rcr7. A higher PPV was identified within the physical interval 41-44 Mb of chromosome C7 in B. oleracea, which indicated that Rcr7 was likely located in this region. This was confirmed via conventional mapping using linkage analysis. Taken together, these results supported the proposal that identification of PPV could be used for genetic mapping of genes of interest. However, its application and efficacy comparing with other MBS methods still needs to be determined. Although some QTLs for clubroot resistance in B. oleracea have previously been identified and mapped 42 , Rcr7 is the first major clubroot resistance gene finely mapped in the B. oleracea genome. Strong resistance to the main pathotypes in Canada (pathotypes 2, 3, 5, 6 and 8) had previously been identified in the A-genome resistance genes Rcr1, Rcr2 and Rc4, but these genes did not confer resistance to pathotype 5X 11,23,43 . On the other hand, the A-genome resistance genes Rcr8 and Rcr9 conferred resistance to pathotype 5X, but did not confer resistance   to pathotypes 2, 3, 5, 6 and 8 11 . In the current study, resistance to pathotypes 3 and 5X was associated with the Rcr7 region. It is likely that resistance to the pathotypes in 'Tekila' is controlled by the single dominant gene Rcr7 or tightly linked genes on chromosome C7. Although Rcr7 donors 'Takila' and 'Kilaherb' also showed complete resistance to 2, 5, 6 and 8, the mapping populations were not tested against these pathotypes due to the availability of the pathotypes and the limited number of seeds in the mapping populations when we performed the study. Our previous studies on Rcr1, Rcr2 and Rc4 indicated that resistance to pathotypes 2, 3, 5. 6 and 8 was associated 11,23,28 . Further studies are needed to determine if resistance to these pathotypes is associated with Rcr7. When a set of B. oleracea lines were tested for resistance to pathotype 3 and assessed using the flanking and co-segregating markers associated with Rcr7, 'Kilaherb' had the same phenotype and marker profile at all of the marker loci as 'Tekila' . Segregation analysis of the 'Kilaherb' × T010000DH3 F 1 population indicated that resistance was controlled by a major dominant resistance gene, similarly to 'Tekila' . Marker profiling for a set of F 1 segregating lines also supported the hypothesis that 'Kilaherb' carried Rcr7. Both 'Tekila' and 'Kilaherb' were developed by Syngenta, so it is possible that the same source of resistance was used in both hybrids.
A patent (http://www.google.com/patents/EP1525317A1?cl=en) on "Clubroot Resistant Brassica oleracea Plants" shows that Syngenta developed a monogenic dominant resistance to the disease clubroot introgressed from B. rapa. Therefore, it is possible that Rcr7 was originated from A-genome of B. rapa. Research in Canada has focused on identification of clubroot resistance genes in B. rapa 11,22,23,28,44 and the A-genome of B. napus [45][46][47] . Most of the clubroot resistance genes effective against pathotype 3 were mapped into the 24 Mb region of chromosome A03 that contains the cloned clubroot resistance genes CRa/CRb kato , where a cluster of four TNL genes, Bra019413, Bra019412, Bra019410 and Bra019409 are located. However, the genes/alleles do not confer resistance to pathotype 5X 11,23,28 , while resistance to both pathotypes 3 and 5X was associated with Rcr7. A search of the homologous regions of A-genome corresponding to the seven TNL genes in the Rcr7 target region of chromosome C7 revealed that the homologous genes in chromosome A03 were not in the previous mapped A03 genes (Rcr1/Rcr2/Rcr4/CRa/CRb) region. In addition, gene specific SNP markers associated with Rcr1/Rcr2/Rcr4 11,23,28 in the F 1 population of 'Tekila' × T010000DH3 were either monomorphic or not tightly linked to Rcr7 (data not shown). Therefore, Rcr7 is a clubroot resistance gene mapped in the B. oleracea genome, possibly originating from a gene in chromosome A03 of B. rapa, but different from Rcr1/Rcr2/Rcr4 based on the genetic location and resistance specificity. However, the Rcr7 origin and clubroot genes corresponding to Rcr7 in B. rapa require further investigations.  Only 2 of the 7 TNL genes associated with Rcr7 (Bo7g108760 and Bo7g109000) were expressed in this study. No polymorphic variants were identified in Bo7g109000, but one polymorphic InDel was found in Bo7g108760. Therefore, Bo7g108760 is possibly a candidate of Rcr7. However, the profiles of gene expression and DNA variants were not fully captured in this study due to low depth of sequencing, so other candidates may exist. The identity of the TNL gene corresponding to Rcr7 will be addressed after the gene has been cloned.
The transition forms (C → T, G → A) dominated single-base substitutions with the ratio of 1.5 (transition): 1 (tranversion). This ratio was similar to the 2:1 ratio reported from the human genome 48,49 . Despite the bialleleic nature of most SNPs in the current analysis, some loci with three segregating nucleotides may also exist in certain populations 50 . Triallelic SNPs were previously reported in the human genome at a very low frequency compared to biallelic variants. Human transcriptome analysis showed that triallelic sites would likely cause non-synonymous 48 changes. In the current study, 34% of the triallelic SNPs resulted in non-synonymous variation in the B. oleracea reference genome.

Materials and Methods
Schematic flowchart of the experimental procedure is shown in Supplementary Fig. S2.  (Table 3).

Evaluation of plants for resistance to clubroot.
A field population of P. brassicae identified as pathotype 3 22 and a field population L-G02 of pathotype 5X 10 were used for inoculation in this study. Clubroot reaction was assessed in controlled environment studies, as described by Chu, et al. 22 . Briefly, the plants were assessed for clubroot symptoms at 5 weeks after seeding using a 0 to 3 scale, where 0 = no symptoms and 3 = large galls 51 . A clubroot rating of 0 was defined as R, 1-3 as S. A disease severity index (DSI) 51 38 following the pooled sample assembly method described by 28 using SeqMan NextGen in DNASTAR.12 software (Lasergene Inc., Madison, WI). For variant analysis, pooled sample assembly (PSA) was selected because it tends to produce a more complete sequence coverage compared to single-sample assembly 28 . Variants (SNPs and InDels) analysis was performed using SeqMan Pro 12 in DNASTAR.12 software, utilizing the following parameters: SNP% ≥15%, P not ref ≥50%, Q-call ≥15 and depth ≥5.
Gene annotation was analyzed with Blast2GO 53 . Further confirmation of the genes with TNL domains was performed with Arabidopsis thaliana WU-BLAST2 Search at http://www.arabidopsis.org/wublast/index2.jsp. KASP assay and linkage analysis. The sequences flanking the selected SNP sites were used for designing KASP primers (Table 1). DNA oligoes were synthesized at Integrated DNA Technologies Inc. (Coralville, IA, USA). The primers carry standard FAM or HEX tails (FAM tail: 5′GAAGGTGACCAAGTTCATGCT3′; hex TAIL 5′GAAGGTCGGAGTCAACGGATT3′) with the targeted SNP at the 3′ end. KASP assays were performed following the principle and procedure available at http://www.kbioscience.co.uk/reagents/KASP_manual.pdf and http:// www.kbioscience.co.uk/download/KASP.swf. The assays were performed using a StepOne Plus real time qPCR system (Applied Biosystem, Mississauga, ON). Linkage analysis was performed using JoinMap V.4 54 .