Development and validation of an SNP genotyping array and construction of a high-density linkage map in castor

Castor is a commercially important oilseed crop that provides raw materials for several industries. Currently, the availability of genomic resources for castor is very limited. In this study, genome-wide SNPs were discovered in castor via whole-genome sequencing of 14 diverse lines to an average of 34X coverage. A total of 2,179,759 putative SNPs were detected, and a genotyping array was designed with 6,000 high-quality SNPs representing 2,492 scaffolds of the draft castor genome (87.5% genome coverage). The array was validated by genotyping a panel of 314 inbred castor lines, which resulted in 5,025 scorable SNPs with a high call rate (98%) and reproducibility (100%). Using this array, a consensus linkage map consisting of 1,978 SNP loci was constructed with an average inter-marker distance of 0.55 cM. The genome-wide SNP data, the genotyping array and the dense linkage map are valuable genomic tools for promoting high-throughput genomic research and molecular breeding in castor.

www.nature.com/scientificreports www.nature.com/scientificreports/ very simple and inexpensive 14 . Concurrently, several high-throughput SNP array-based genotyping platforms have been developed 15 . Recently, Genotyping by Sequencing (GBS) has become a popular high-throughput marker detection platform 16 . GBS is an NGS protocol employed to simultaneously discover and score segregating markers in populations of interest by sequencing highly multiplexed reduced representation libraries of samples 17 . However, at present, this technique has certain limitations for crop genotyping. A major complication is the high-level DNA sequence duplication in angiosperms. The critical level of sequencing read depth for a given experimental design can be affected by factors such as genome size, genome complexity, ploidy, expected heterozygosity, and others. Poor coverage of specific loci lead to an excessive number of missing values, compromising the accuracy of genetic studies 18 . Moreover, the experimental operation of GBS and data analysis are beyond the capabilities of an average breeding programme; in contrast, 'genotyping arrays' can be used to genotype many samples within a short period, and the data analysis is relatively easy 19 .
In this context, this study was undertaken with the following objectives: (i) to discover genome-wide SNPs in castor through whole-genome sequencing of a set of diverse genotypes, (ii) to develop and validate an SNP genotyping array, and (iii) to construct a dense linkage map for high-throughput genomic applications in castor.

Results
Whole-genome sequencing and discovery of sNps. A set of 14 diverse castor genotypes, including cultivars and germplasm accessions, was chosen for whole-genome sequencing ( Table 1). Paired-end (PE) sequencing libraries were generated using the genomic DNA of all samples. The libraries were of high quality per the sizing profiles in Agilent HS Chip. The fragment size of the libraries ranged from 402 bp to 875 bp. Sequencing of these libraries using Illumina's MiSeq (3 genotypes) and HiSeq (11 genotypes) platforms yielded 5.4 to 22.0 Gb of equivalent data per sample with 15X to 62X coverage. In total, 169 Gb of DNA sequence (755 million 2 × 100 bp reads and 36 million 2 × 250 bp reads) data were generated for 14 castor genotypes. The number of reads per sample ranged from 11.3 to 108.9 million ( Table 2). The huge difference in the number of reads among samples was due to the use of two different sequencing platforms (MiSeq and HiSeq by Illumina). The MiSeq platform yielded an average of 12.2 million reads (6.2 Gb data), whereas HiSeq yielded 68.7 million reads (13.7 Gb data).
The high-quality filtered reads were aligned with the reference genome. The mapping percentage ranged from 87.4 to 98.1. On an average, 95.8 per cent of the reads could be mapped onto the reference genome. A good level of genome coverage was observed. The per cent genome coverage (without Ns) ranged from 92.8 to 97.4, with an average of 95.4. The number of SNPs/InDels identified between sample and reference ranged from 294,956 to 929,114 (Table 3). A total of 2,179,759 SNPs were identified across the samples, with a frequency of one SNP per 160 bp.
Genotyping array design. The putative SNPs identified through sequencing of 14 castor genotypes were further filtered for inclusion in genotyping assay development. On on average, 15 per cent of SNPs remained after filtering out the (i) SNPs with a less than 75 bp flanking region, (ii) SNPs located within 100 bp from the target SNP, (iii) insertion or deletion variants (InDels), and (iv) heterozygous SNPs. After filtration, the total number of common SNPs across 14 samples was 130,666, of which 21,185 were from genic regions. From this set, 10,872 SNPs were initially selected for array design, considering their functional roles (fatty acid biosynthesis, oil metabolism, disease resistance, and ricin biosynthesis) and genomic coverage (maximum possible representations of scaffolds of the reference genome). Out of 10,872 SNPs, we finally selected 6,000 SNPs, that required only one bead type for genotyping and had an assay design score of >0.7 for assay development. The details of the SNPs selected for array design are given in Supplementary Table S1. The set of SNPs in the array represented 2,492 scaffolds of draft castor genome, with genome coverage of 87.5 per cent. Out of 6,000 SNPs in the final array, 3,619 SNPs were from genic regions.
Validation of genotyping array. The technical performance and the practical utility of the SNP array were assessed by genotyping a panel of 314 inbred castor lines. Of the 6,000 targeted SNPs, only 5,238 could be included on the Infinium BeadChips. Out of the 5,238 SNPs present on the Infinium BeadChips, high-quality genotype calls were obtained for 5,038 SNPs, while clear genotype clustering was not obtained for the remaining 200 SNPs.
Out of 5,038 scorable SNPs, 927 were monomorphic (19%) and 13 SNPs had missing data points of more than 20 per cent in the panel of 314 lines. Finally, 4,098 SNPs were called successfully with high confidence. The GeneTrain and GeneCallC50 scores of all the SNPs were >0.4. The call rate of SNPs ranged from 80 to 100 per cent, with an average of 98 per cent (Fig. 1). The reproducibility of all 4,098 SNPs across technical and biological replicates was 100 and 99.62 per cent, respectively. From these parameters, it is very evident that the genotyping quality and reliability of this array are very high.
The biological utility of the SNP array was assessed by analyzing the genetic diversity and relationships in the genotype panel. Of the 4,098 informative SNPs, 2,690 (65.64%) had an MAF of >0.2 and could be considered as markers with normal allele frequencies. Approximately 18 per cent of SNPs had an MAF of 0.1 to 0.2. A total of 291 SNPs had an MAF of <0.05. In addition, 489 SNPs (12%) showed almost equal allele frequencies (with MAF close to 0.5) for two alternative alleles. The distribution of MAF is shown in Fig. 2.
The polymorphism information content (PIC) values for individual SNPs ranged from 0.003 to 0.375, with the peak distribution at >0.3 (Fig. 3). The average observed heterozygosity across the castor lines was 0.11, and the expected heterozygosity (gene diversity) ranged from 0.003 to 0.500, with an average of 0.351. A neighbour-joining tree was constructed using the genotypic data from 4,098 SNP loci in 314 castor lines. The genetic relationship revealed by the dendrogram was as expected based on the pedigree and breeding history. For instance, a pistillate line, DPC9, and its nine mutant selections were placed together. Similarly, the inbred lines with shared pedigrees were in close proximity in the dendrogram ( Supplementary Fig. 1).
www.nature.com/scientificreports www.nature.com/scientificreports/ Construction of linkage map. Linkage map of JC12×48-1 cross. The RIL population of the JC12 × 48-1 cross, along with their parents, were genotyped using the SNP array. A total of 1,090 SNP loci were polymorphic and segregated in the population, of which 246 loci (22%) showed segregation distortion towards either of their parents. Out of 1,090 SNPs, 1,028 showed unique segregation. Linkage analysis was performed using the genotypic data from all 1,090 SNP markers. All markers could be mapped, and a genetic linkage map was constructed with 10 linkage groups, corresponding to the number of haploid chromosomes. The total length of the map was 1,139.8 cM. The number of markers per linkage group ranged from 82 to 207 (Table 4). The map was very dense, with an average inter-marker distance of 1.12 cM. The longest gap was 12.4 cM. Only seven marker intervals were longer than 10 cM. The map is provided in Supplementary Table S2.
Linkage map of DCS9×RG1139 cross. The RILs of the DCS9 × RG1139 cross, along with their parents, were genotyped using the SNP array. A total of 1,273 SNP loci were scored as polymorphic in the population. About 28 per cent of the loci showed skewed segregation. Linkage analysis was performed using the genotypic data from all SNP markers, and a genetic linkage map was constructed with 10 linkage groups. The total length of the map was 904.8 cM. The number of markers per linkage group ranged from 45 to 183 (Table 4). The map was very dense, with average inter-marker distance of 0.81 cM. The longest gap was 8.0 cM. The map is provided in Supplementary  Table S3.
Consensus map. There were 392 markers in common between the two crosses, which enabled the construction of a consensus map. The number of common markers per linkage group ranged from 9 (LG-10) to 75 (LG-1). The consensus map was composed of 1,978 SNP markers and spanned a total length of 995.8 cM with an average distance of 0.55 cM between adjacent markers. The marker orders in the consensus map were in general consistent with the individual genetic linkage maps (Fig. 4). The map is provided in Supplementary Table S4.
The pairwise comparison of the linkage maps indicated no potential chromosomal rearrangements, and none of the markers were found to map to different LGs.

Discussion
Even though castor is an industrially important crop, it has not yet benefited from the genomic revolution. The availability of genomic resources for castor is very limited. This study attempted to generate genomic resources, viz., genome-wide SNP loci, a genotyping array and a linkage map.
To identify genome-wide SNPs, whole-genome sequence data for 14 inbred castor lines were generated with very high coverage (average 34X) and mapped to the reference genome with genome coverage (without Ns) of 92.8 to 97.4 per cent. The high mapping percentage achieved in this study might be due to the use of paired-end sequencing, which helps to precisely map the short reads onto the reference. In sorghum, Nelson and his collegues 20 could map only 30 per cent of the reads when they used single-end reads from restriction site associated DNA sequences. By aligning the sample reads with the reference, a total of 2,179,759 SNPs were www.nature.com/scientificreports www.nature.com/scientificreports/ identified across samples, with a frequency of one SNP per 160 bp. The SNP frequency observed in castor appears to be higher than that observed in self-pollinated crops such as pea (1 SNP/540 bp) 21 , soybean (1 SNP/490 bp) 22 , rice (1 SNP/238 bp) 23 , and wheat (1 SNP/233 bp) 23 but lower than in cross-pollinated perennial species such as Eucalyptus (1 SNP/16 bp) 24 and oil palm (1 SNP/74 bp) 25 . However, it is important to note that the frequencies of SNP occurrence reported in the above-mentioned studies might be over/under estimated because those studies were based on only a few genes or transcriptomes, not whole genomes. The frequency of SNP occurrence varies widely among crops depending upon their domestication history, their reproductive habits (autogamous or allogamous), the diversity of the populations under assessment, and the status (coding or non-coding) of the analysed regions 21 . From the large catalogue of SNPs, a genotyping array with 6,000 carefully chosen SNPs was designed. The set of SNPs in the array represented 2,492 scaffolds of the draft castor genome with genome coverage of 87.5 per cent. Out of 6,000 SNPs in the final array, 3,619 were from genic regions. SNPs in genic regions could potentially be more informative in genome-wide association studies.
To assess the utility of the SNP array, a set of 314 castor lines representing the whole spectrum of the diversity available in the species was genotyped using the SNP array. The genotype panel comprised accessions of core germplasm, trait-specific germplasm lines, elite breeding lines, commercial varieties, and parents of commercial hybrids. Of the 6,000 targeted SNPs, only 5,238 could be included on the Infinium BeadChips. The manufacturing loss (12.7%) was within the expected limits and comparable to that in other studies [26][27][28][29] . Out of 5,238 SNPs present on the Infinium BeadChips, high-quality genotype calls were obtained for 5,038 SNPs. The success rate observed in this study (96.2%) was higher than the typical ≥90% rate obtained earlier with other crops. The high success rate might be due to the high sequencing depth and stringent filtration criteria applied while selecting SNPs for assay development. We carefully removed additional polymorphisms in the flanking sequences of target SNPs. The lower success rates observed in other studies could be due to their lack of specific selection for conserved SNP flanking sequences during genotype assay development. For example, the success rate in conifers was  www.nature.com/scientificreports www.nature.com/scientificreports/ 69 to 77 per cent 30 , and in Pinus pinaster was 61 to 73 per cent 31 , when no specific selection for more conserved SNP flanking sequences was carried out.
Around 19 per cent of the SNPs were scored as monomorphic. Such a high level of false positives may have resulted from the use of draft assembly for SNP prediction. It is expected that reads from regions missing in a draft reference may map incorrectly to other regions of the assembly and lead to false SNP predictions. These would then show as monomorphic in the Illumina assay. Nevertheless, the technical performance of the array was superior, as evidenced by the high average call rate (98%) and reproducibility (100%).
The diversity parameters derived from the genotypic data for 314 inbred castor lines representing the whole spectrum of the diversity available in the species indicated only a moderate level of genetic diversity in castor. The average observed heterozygosity was 0.11, which is within the expected range of residual heterozygosity reported earlier 13,32 . The expected heterozygosity (gene diversity) ranged from 0.003 to 0.500, with an average of 0.351. The low level of heterozygosity could be due to preferential selection of SNPs in genic regions, which usually show reduced diversity. Previous studies have also reported low to moderate gene diversity estimates irrespective of the marker system. Allan and his collegues 11 surveyed 41 germplasm accessions from five continents and 35 countries using AFLPs (3 primer combinations) and SSRs (9 primer pairs) and found that the average gene diversity was 0.126 for AFLPs and 0.239 for SSRs. Qiu and his collegues 32 reported a relatively higher level of gene diversity   www.nature.com/scientificreports www.nature.com/scientificreports/ (He = 0.41) by genotyping 24 accessions representing the main germplasms of castor from 14 countries using 118 polymorphic EST-SSR markers. Comparable levels of gene diversity (He = 0.38) were observed in our earlier study, in which 144 inbred lines derived from the core germplasm collection of castor were genotyped using 45 SSR markers 12 . The gene diversity estimate is generally influenced by the number of marker loci and the number/ representation of genotypes used in the study. As the present study involved a large number of genotypes (more than 300 lines comprising core germplasm, breeding lines, and cultivars) and markers (over 4,000 SNP loci), we can confidently infer that only a moderate level of molecular diversity prevails in castor, despite its being a predominantly outcrossing species. The low level of diversity in castor might be due to its monotypic species status.
The genetic linkage map is a critical tool for molecular genetic studies and plant breeding applications. It is an important tool for physical mapping of genomes. High-density linkage maps have direct applications in marker-assisted selection through tight linkage of markers with the gene of interest. High-density genetic linkage maps are also useful in orienting and anchoring scaffolds arising from the genome sequence data 33,34 . To date, only two linkage maps largely based on SSR markers have been constructed for castor 4,7 . In this study, two RIL populations were used to construct a linkage map. The SNP array was used to genotype the population. The total number of polymorphic markers was 1,090 in the JC12 × 48-1 cross and 1,978 in the DCS9 × RG1139 cross. The proportion of skewed markers was more or less similar in both populations (22% in JC12 × 48-1 and 28% in DCS9 × RG1139); however, only 40 markers were shared across populations, indicating that the segregation distortion is specific neither to the cross nor the genomic region. The maps are fairly dense. The average inter-marker distance is 1.12 cM in the JC12 × 48-1 map and 0.81 cM in the DCS9 × RG1139 map.
It is interesting to note that the maps differ in total length, even though both were derived from RIL populations. The DCS9 × RG1139 map (904.8 cM) was smaller than the JC12 × 48-1 map (1,139.8 cM), which may be attributed to reduced recombination in DCS9 × RG1139 35 . In general, the maps of inter-specific crosses are significantly shorter than those of intra-specific crosses 36 . This was not the case in the present study, although RG1139 is a wild collection, whereas the other parents are improved breeding lines. Even though the individual populations differed in the extent of recombination, the marker order between individual maps was largely conserved (Fig. 4).
A consensus map was constructed using 392 markers that were shared between the two maps. The consensus map covered 995.8 cM and included 1,978 SNP markers, with only one gap of >10 cM and nine other gaps of >5 cM. The length of the consensus map was in between those of the individual genetic maps. A total of 722 scaffolds of the draft castor genome, covering 185.13 MB were anchored onto the consensus map. These anchored genome sequences could be used as a source for the development of additional markers 37 and provide a valuable tool for fine-mapping and map-based cloning.
In summary, the SNP array developed in this study provides a valuable tool for high-throughput and cost-effective genotyping and mapping applications in castor. The genotyping array will provide the required resolution for discovery of marker-trait association through linkage and association analysis, evaluation of genetic variations, unravelling the genetic architecture of key quantitative traits, and exercising genomic selection in castor. The high-density linkage map can act as a reference for genetic and molecular studies in castor. The map can be readily used for (i) improving the genomic assembly of castor, (ii) QTL identification for several important traits, (iii) reference in constructing framework maps for individual populations, (iv) comparative genetic and genomic studies with related crops like cassava, and (v) identification of novel genes/alleles from related species.

Materials and Methods
Materials. Candidate genotypes for sequencing. A set of 14 inbred castor lines was chosen for whole genome-sequencing based on their genetic distinctness and utility in genetic studies and breeding programmes. The genotype panel represented wild (germplasm accessions) and cultivated (commercial varieties, parental lines of commercial hybrids, and parents of mapping populations) genepools. The details of the selected lines are given in Table 1.  www.nature.com/scientificreports www.nature.com/scientificreports/ Genotype panel for validation of sNp array. The SNP array designed in this study was validated by genotyping a set of 314 inbred castor lines. The composition of the genotype panel included (i) a castor core germplasm set derived from a global collection of over 3,000 accessions, (ii) known sources for important agronomic traits, (iii) elite inbred lines developed from major castor breeding centres in India, and (iv) varieties and parental lines of hybrids released for cultivation; thus, the panel fairly represented the diversity available in castor.
Mapping populations. Two recombinant inbred line (RIL) populations derived from the crosses JC12 × 48-1 and DCS9 × RG1139 were used to construct of linkage maps. The parents differed on range agronomic traits, including resistance to gray mold disease caused by Botryotinia ricini, vascular wilt caused by www.nature.com/scientificreports www.nature.com/scientificreports/ Fusarium oxysporum f.sp. ricini, and reniform nematode (Rotylenchulus reniformis). One true F 1 plant of both the crosses was advanced to F 2 by selfing. The individual F 2 plants were further advanced up to F 6 generation through the single-seed descent method. Finally, a total of 185 F 6 -RILs of JC12 × 48-1 and 188 F 6 -RILs of DCS9 × RG1139 were used for linkage map construction.

Methods
Whole-genome sequencing. Genomic DNA from 14 candidate genotypes was isolated using the CTAB method. The quality of the DNA was checked through 0.8% agarose gel electrophoresis and quantified using a Qubit ® Fluorometer. Paired-end (PE) whole-genome sequencing (WGS) libraries were generated using the sample preparation kit from Illumina (Illumina Inc., USA) per the manufacturer's protocol. Fragmentation of genomic DNA (1 µg) was carried out using the Covaris S2 System (Covaris, Inc., USA) followed by adapter ligation. Enrichment of adapter-modified DNA fragments by PCR was performed, and the enriched libraries were analyzed for size and quality using a high-sensitivity DNA kit on the Bioanalyzer (Agilent Technologies, USA). A paired-end cluster generation kit (Illumina Inc., USA) was used for cluster generation on paired-end flow cells.
Cluster generation was carried out by hybridization of WGS libraries with a concentration ranging from 9 to 55 nmol/L, which was further diluted to make a stock solution of 4 nmol of molecules on the oligonucleotide-coated surface of the flow cell. Isothermal amplification of the libraries was carried out to generate clonal DNA clusters. Eleven libraries were multiplexed for sequencing on five different lanes of HiSeq2000 (Illumina Inc., USA) with an average read length of 2 × 100 bp. Three libraries were individually sequenced on MiSeq (Illumina Inc., USA) with an average read length of 2 × 250 bp.
Discovery of genome-wide sNps. The raw reads were filtered using Trimmomatic v0.17 38  An available draft genome sequence of Ricinus communis L. with a 350 Mb genome size 3 was downloaded from http://castorbean.jcvi.org/downloads.php. The filtered sample reads were aligned with the reference genome using BWA software 39 with default parameters. SNP discovery and filtration were carried out using SAMtools v0.1.19 (mpileup and varFilter) 40 with default parameters. The SNPs were filtered based on the following parameters: (i) minimum RMS mapping quality for SNPs: 25; (ii) minimum read depth: 8; (iii) window size for occurrences of one SNP: 100 bp; and (iv) window size for filtering dense SNPs: 1.
Design of genotyping array. A genotyping array was developed on the Illumina Infinium platform (Illumina Inc., USA) with 6,000 carefully chosen SNPs. From the set of SNPs that passed the technical quality filters, as described in section 2.3, further filtration was carried out by removing (i) the SNPs with minor allele frequency of <0.2, (ii) SNPs located on the terminal region of the scaffolds and with less a than 75 bp flanking region, (iii) SNPs located within a distance of 100 bp to the target SNP, (iv) the insertion or deletion variant (InDel) and heterozygous SNPs, and (v) SNPs in the repetitive region. Among the SNPs amenable to Illumina Infinium assay development, SNPs were further selected based on functional roles (fatty acid biosynthesis, oil metabolism, disease resistance, and ricin biosynthesis) and genomic coverage (maximum possible representation of scaffolds with respect to the reference genome). The custom Infinium genotyping BeadChips assay was designed using Illumina Assay Design tools (ADT) for the selected SNPs. Finally, 6,000 SNPs that required only one bead type for genotyping and had assay design scores of >0.7 were selected for assay development.
Validation of genotyping array. The technical performance of the array was evaluated by genotyping a panel of 314 inbred castor lines. The genomic DNA was extracted from 100 mg of leaf samples collected from plants of each of the inbred lines using the CTAB method. The DNA samples were quantified using Quibit and a NanoDrop-8000 UV-Vis Spectrophotometer (Thermo Fisher Scientific, USA). The samples' quality was checked through agarose gel electrophoresis. Illumina Infinium genotyping assay was performed per standard procedures, which are briefly described below.
Genomic DNA samples from 314 lines in addition to two biological and two technical replicates, were amplified overnight. The amplified products were subsequently fragmented by a controlled enzymatic process. The DNA fragments were alcohol-precipitated and re-suspended. The BeadChips were prepared for hybridization in the capillary flow-through chamber. Samples were applied onto the BeadChips and incubated overnight. The DNA fragments of samples annealed to locus-specific oligomers during the hybridization step, wherein one bead type corresponded to a specific allele of the SNP locus. After hybridization, allelic specificity was conferred by enzymatic base extension. The products were subsequently fluorescently stained. The intensities of the beads' fluorescence were detected by the iScan system (Illumina Inc., USA), and genotype calling was performed using GenomeStudio software (Illumina Inc., USA) with a GenCall score cut-off of 0.15 (as recommended by Illumina for Infinium data).
The genotyping array;s quality was assessed based on the call rate, reproducibility, and polymorphism detection ability of the SNPs in the array. Various genetic diversity parameters, namely, minor allele frequency (MAF), observed heterozygosity (Ho), gene diversity or expected heterozygosity (He), and polymorphic information content (PIC), were estimated using the software program PowerMarker version 3.25 41 .
Construction of linkage map. The RILs of JC12 × 48-1 and DCS9 × RG1139, along with their parents, were genotyped using the SNP array developed in this study. Genotyping was done as described in section 2.5.