Genome resequencing and comparative variome analysis in a Brassica rapa and Brassica oleracea collection

The closely related species Brassica rapa and B. oleracea encompass a wide range of vegetable, fodder and oil crops. The release of their reference genomes has facilitated resequencing collections of B. rapa and B. oleracea aiming to build their variome datasets. These data can be used to investigate the evolutionary relationships between and within the different species and the domestication of the crops, hereafter named morphotypes. These data can also be used in genetic studies aiming at the identification of genes that influence agronomic traits. We selected and resequenced 199 B. rapa and 119 B. oleracea accessions representing 12 and nine morphotypes, respectively. Based on these resequencing data, we obtained 2,249,473 and 3,852,169 high quality SNPs (single-nucleotide polymorphisms), as well as 303,617 and 417,004 InDels for the B. rapa and B. oleracea populations, respectively. The variome datasets of B. rapa and B. oleracea represent valuable resources to researchers working on evolution, domestication or breeding of Brassica vegetable crops.


Background & Summary
Many important crops have been domesticated and are cultivated from the genus Brassica, including those used as oilseeds, condiments, fodder, or vegetables. The 'triangle of U' 1 describes the relationships between the six economically important Brassica species. Three of the six species are diploids (A genome, Brassica rapa, n = 10; B genome, B. nigra, n = 8; and C genome, B. oleracea, n = 9), while the other three species are allotetraploids resulting from interspecific hybridization between the diploids (AB genomes, B. juncea, n = 18; AC genomes, B. napus, n = 19; BC genomes, B. carinata, n = 17). These different crops are characterized by their specialized development of organs 2 , and are referred to as morphotypes.
Interestingly, similar morphotypes are often selected for in two or more Brassica species, clearly illustrating convergent crop domestication. This includes the leafy heads in Chinese cabbage (B. rapa) and cabbage (B. oleracea) and tubers at stems or hypocotyls/roots in kohlrabi's (B. oleracea), swede's (B. napus) and turnips (B. rapa), enlarged stems in marrow stem kale (B. oleracea) and also broccoletto (B. rapa), enlarged inflorescences in cauliflower and broccoli's (B. oleracea), the many axillary shoots in Brussels sprouts (B. oleracea) and mizuna/mibuna's (B. rapa) and the increased numbers of enlarged seedpods in oilseed (B. rapa and B. napus).
In recent years, the genomes of B. rapa and B. oleracea have been assembled and released [3][4][5] . Brassicaceae comparative genomics provided 24 genomic blocks (GBs) as the framework for genome studies in Brassicas 6 . With the GBs system, the comparative genomic analysis of B. rapa, B. oleracea, and A. thaliana revealed the whole genome triplication (WGT) event that is shared by all Brassicas 7-9 , and made it possible to deduce and reconstruct the diploid ancestor for the Brassiceae tribe 10 . Further studies reported the phenomenon of sub-genome dominance among the three sub-genomes in Brassicas 8,9 , and found that the biased distribution of small RNA-targeted TEs plays an important role in this phenomenon 11,12 . Furthermore, the release of the genome sequences has also contributed to the mapping, cloning and functional studies of agronomic genes in Brassica crops, such as two clubroot resistance genes 13,14 , two blackleg resistance genes 15,16 , as well as two key genes regulating the time of flowering [17][18][19] . The availability of these reference genomes also made it possible to build variome datasets by resequencing populations of B. rapa and B. oleracea, to investigate domestication of the diverse morphotypes.
Here, we selected 199 B. rapa and 119 B. oleracea accessions representing the 12 and nine morphotypes from these two species, respectively (Fig. 1, Supplementary Tables 1 and 2). These accessions originate from a wide geographic range, from North till South Asia and Europe 20 . Whole genome resequencing data of these 199 B. rapa and 119 B.oleracea accessions was generated using the Illumina HiSeq2000 platform, with paired-end reads from 350 bp insert libraries. After filtering lowquality or duplicated reads, totally, we produced 982.87 Gb resequencing data for the 199 B. rapa accessions and 742.35 Gb for the 119 B. oleracea accessions, which is about on average 10 × coverage for accessions of both B. rapa and B. oleracea, ranging from 1.67 × till 19.8 × coverage. These reads were then mapped to the reference genomes of B. rapa version V1.5 and B. oleracae V1.0, respectively. Finally, we called out 2,249,473 and 3,852,169 reliable SNPs (single-nucleotide polymorphisms), as well as 303,617 and 417,004 InDels for the B. rapa and B. oleracea populations, respectively. With these datasets, we have identified genomic selection signals for traits of leaf-heading and tuberous organs in both B. rapa and B. oleracea, and found that sub-genome parallel selection was associated with morphotype diversification and convergent crop domestication in the two Brassica species 21 .
The whole genome resequencing data and variome datasets for the B. rapa and B. oleracea populations are valuable resources to researchers studying evolution, domestication or trait discovery, and to breeders of Brassica crops. These data can be exploited to develop genetic markers, and chip arrays for gene mapping and functional studies.

Methods
These methods are expanded versions of descriptions in our related work 21 .
Sample collection of B. rapa and B. oleracea  Table 1 Table 2).
We have grown all the 199 B. rapa accessions with three replicates in the green house during autumn 2014 to confirm their morphotypes. The phenotypes of these accessions were investigated till their maturity.

Sample preparation and resequencing
Plants of the 199 B. rapa accessions were grown in a greenhouse, each accession with five replicates. At the six-leaf stage, the two youngest leaves were collected from one of the five plants and DNA was extracted from these leaves. For B. oleracea, the DNA of the 44 germplasm lines and six genebank accessions was extracted as described for B. rapa. However for the DH and inbred lines, 50-100 seedlings per genotype were grown on moist filter paper. Cotyledons and hypocotyls were harvested after 12 days and DNA was then extracted. DNA libraries with approximately 350 bp insert sizes were constructed following the manufacturer's instructions (Illumina GAII) and paired-end resequencing reads were generated by commercial Illumina HiSeq 2000 service provided by Biomarker Technologies Corporation, Beijing, and BGI, Shenzhen.
Raw data were filtered before alignment We filtered the raw reads before alignment to the corresponding reference genomes of B. rapa 3 or B. oleracea 5 . First, low quality reads were filtered based on the following three rules. If one end of a paired-end read had >5% 'N' bases, then the paired-end read was removed. Secondly, for each pairedend read pair, if one of two reads had an average base-quality less than 20 (Phred-like score), then both reads were removed. Thirdly, for each read, we trimmed its 3' bases if the quality scores are below 13. The trimming was stopped at the base with quality score ≥13. After trimming, if the remaining read was less than 40, the paired-end reads were removed. Moreover, considering that the duplicates of paired-end reads were generated from a single amplicon, we further removed redundant duplicated reads and only kept a single pair. We completed this raw read filtering process using an in-house made Perl script.

Alignment and variants calling
Filtered reads of each re-sequenced sample were mapped onto the corresponding reference genomes using the software Burrows-Wheeler Aligner (BWA version 0.7.5a-r405) 22 . We used the B. rapa genome version 1.5 and B. oleracae V1.0 as the references for the two species 4 . The reference genomes were first indexed by using commands as 'bwa index reference.fa' and 'samtools faidx reference.fa'. The clean reads of each accession were then mapped to the indexed reference genomes one by one with the algorithm 'mem' of BWA.  23 was first used to transfer the sam file to bam file by command 'samtools view -bS sample.sam sample.bam'. The bam file was then sorted by 'samtools sort sample.bam sample.sorted.bam'. After index the sorted bam file with 'samtools index sample.sorted.bam', candidate genomic variants were called out by using the algorithm 'mpileup' of Samtools. The full command was 'samtools mpileup -q 20 -Q 30 -ugf reference.index sample.sorted.bam | bcftools view -p 0.9 -cg ->candidateVariants.list'. We set '-q 20' to use nucleotides with Phred-like quality scores higher than 20 as reliable nucleotides of a read to report variants. '-Q 30' denotes that mapping quality of reads should be higher than 30 to be considered as a reliable mapping. Bcftools was used to transfer the vcf file generated by 'mpileup' and reported candidate variants in an output file 23 . We used '-p 0.9' to ask Bcftools to report variants at a locus with more than 10% reads showing a different genotype to that of the reference.
We further filtered the candidate variant list for reliable variants. For each accession, we screened its variations one by one. Here, we called the genotype that is the same as the reference 'reference allele', while the one that is different to the reference as 'alternative allele'. For each variant, only alleles that were covered by sufficient reads (≥3 reads) were considered as confident alleles, and the variant was kept as a reliable variant. We performed this process for all Brassica accessions. This removed potentially false variants produced by sequencing errors. Since most of the Brassica samples (except six genebank accessions of B. oleracea) are from homozygous or almost homozygous accessions, we filtered our data further by retaining the variant calls that were homozygous in individual homozygous accessions (loci were considered as heterozygous if 0.2>D R /(D A +D R )>0.8, where D R denotes the number of reads with the reference allele, and D A denotes the number of reads with the alternative allele). We developed in-house Perl scripts to complete the candidate variants filtering processes. In order to remove rare SNPs and use common SNPs to scan for selection signals, we filtered out SNPs that had a MAF (minor allele frequency) o0.05 in both B. rapa and B. oleracea. Totally, we obtained 2,249,473 SNPs and 303,617 InDels for the B. rapa population, as well as 3,852,169 SNPs and 417,004 InDels for the B. oleracea population. For the two accessions with lowest re-sequencing depth in B. rapa and B. oleracea, we still obtained genotypes of 1,577,624 and 2,690,136 SNPs, respectively. We plotted genomic heterozygosity, density of SNPs and InDels along with the gene density to show the distribution features of these variants in the two Brassica genomes (Fig. 2).

Functional annotation of genomic variants
The variants identified in the two Brassica genomes were further annotated into different groups (Table 3 and Table 4). According to the genomic positions of SNPs and InDels relevant to predicted gene models, we first separated them into 1) variants located at genic regions and 2) at inter-genic regions. Variants located at genic regions were further separated into three subgroups: a) variants in coding sequences (CDs), b) in introns and c) in untranslated regions (UTRs). The SNPs and InDels located at CDs (1a) were classified into two subgroups: the subgroup causing changes to the coding amino acids, including non-synonymous SNPs and frame shift InDels, and the subgroup causing no changes to the amino acids, including synonymous SNPs and InDels that do not cause frame-shifts. Intronic variants (1b) were also separated into two subgroups: causing (8 bp to the splice site) or not causing splice site mutations; UTR variants (1c) are divided into two subgroups: variants in 5 0 or 3′UTR regions (Fig. 3a,b). The results show that among these 2,249,473 SNPs and 303,617 InDels in B. rapa, 161,319 SNPs (nonsynonymous and splice site) and 16,905 InDels (CDS and splice site) respectively, introduced changes to the protein sequences (Table 3); for the 3,852,169 SNPs and 417,004 InDels in B. oleracea, 154,863 SNPs and 16,687 InDels respectively introduced changes to the protein sequences (Table 4). Additionally, we analyzed the length distribution of InDels, and found that 1 bp deletions or insertions are the dominant InDels (Fig. 3c) in both Brassica populations. The length distribution of InDels located in coding regions of genes was also investigated, and it was found that InDels of three or fold changes of three nucleotides are clearly the dominant types (Fig. 3d). These InDels will not introduce frame-shift mutations to genes, thus are under less stringent selection compared to InDels that don't correspond to fold changes of three.
We further investigated the genetic diversity within each morphotype group in both species with the annotated genomic variations. We performed the analysis by counting the number of polymorphic variants in each morphotype group. The results showed that groups of turnip (54) and Chinese cabbages (46) in B. rapa had most polymorphic loci based on both total SNPs as well as functional SNPs (non-synonymous SNPs or SNPs located at splicing sites) (Supplementary Table 3). Interestingly, the groups of turnip and pak choi had the most polymorphic loci based on total InDels and functional InDels      Table 4). The number of polymorphic variants reflects the genetic diversity in each group. However, the number of variants is also impacted by the numbers of samples studied.

Code availability
Custom perl scripts used to filter raw reads and candidate genomic variants can be downloaded through http://brassicadb.org/brad/datasets/pub/ReseqPars/codes/ in Brassica database BRAD, or are freely    available when requested by email to the authors. Other tools used in this work including version details are BWA (version 0.7.5a-r405) and SAMtools (version 0.1.19-44428 cd). Command lines with details of parameters when using these tools are described in the method section.

Data Records
All the raw reads of 199 B. rapa and 119 B. oleracea accessions used in this work have been deposited in the NCBI Sequence Read Archive with the accession ID SRP071086 (Data Citation 1). The SNP and InDel datasets identified on the resequencing data and related vcf files are public available in Brassica Database BRAD (Data Citation 2).

Technical Validation
In order to measure the quality of variants called out from the resequencing data, we selected five SNPs (Table 5), and genotyped 95 out of the 199 B. rapa accessions for these SNPs using the method of KASP (kompetitive allele specific PCR). With the same method, we also test the polymorphic level of five SNPs determined from 119 B. oleracea accessions in another group of 281 B. oleracea accessions. KASP is a homogeneous, fluorescence-based endpoint SNP genotyping platform (LGC Genomics LLC, Beverly, MA, USA). The five selected SNPs satisfied the following criteria of KASP experiments: 1) No other genomic matches were found for the 50 bp sequences flanking the two sides of each candidate SNP; 2) There are no other SNPs or InDels located at the 50 bp flanking regions of the candidate SNP. For each SNP, two allele-specific primers A1 and A2 and one common reverse primer C were designed by LGC Genomics. The PCR reaction mixture had a total volume of 5 μl and included 2.5 μl DNA, 2.5 μl 2 × master mix with 0.07 μl primer mix according to the manufacturer's guidelines. Three no-template controls were included for each SNP locus. The amplification process was ran in Gene Amp PCR System 9700 (Applied Biosystems) using the following program: 94°C for 15 min followed by 10 touchdown cycles of 94°C for 20 s, 61-55°C for 60 s (decreasing by 0.6°C per cycle), followed by 26 cycles of 94°C for 20 s, 55°C for 60 s. Fluorescence detection was then performed in a 7900 HT Fast Real-Time PCR System (Applied Biosystems), and the results were analysed using SDS2.3 Software (supplied by Applied Biosystems).
We further compared the genotypes of the five SNPs in the 95 samples that were called out by resequencing data, with the genotypes of these loci that were reported by the KASP experiments. Results show that all the 475 loci analyzed have consistent genotypes between the two methods of resequencing and the KASP experiment (Table 5), supporting the fact that the genomic variants are of high quality. The polymorphism level of the five SNPs in B. oleracea is shown in Supplementary Table 5.

Usage Notes
All the resequencing datasets of this work are deposited in the NCBI database with the accession number SRP066057 (alias: PRJNA301642). Access to the plant materials (or DNA samples for materials provided from companies) described in this paper is possible for research purposes via a material transfer agreement with the material owners. Please contact Dr Xiaowu Wang (wangxiaowu@caas.cn) or Guusje Bonnema (guusje.bonnema@wur.nl).