A 62K genic-SNP chip array for genetic studies and breeding applications in pigeonpea (Cajanus cajan L. Millsp.)

Pigeonpea is the second most important pulse legume crop for food and nutritional security of South Asia that requires accelerated breeding using high throughput genomic tools. Single nucleotide polymorphisms (SNPs) are highly suitable markers for this purpose because of their bi-allelic nature, reproducibility and high abundance in the genome. Here we report on development and use of a pigeonpea 62 K SNP chip array ‘CcSNPnks’ for Affymetrix GeneTitan® platform. The array was designed after filtering 645,662 genic-SNPs identified by re-sequencing of 45 diverse genotypes and has 62,053 SNPs from 9629 genes belonging to five different categories, including 4314 single-copy genes unique to pigeonpea, 4328 single-copy genes conserved between soybean and pigeonpea, 156 homologs of agronomically important cloned genes, 746 disease resistance and defense response genes and 85 multi-copy genes of pigeonpea. This fully genic chip has 28.94% exonic, 33.04% intronic, 27.56% 5′UTR and 10.46% 3′UTR SNPs and incorporates multiple SNPs per gene allowing gene haplotype network analysis. It was used successfully for the analysis of genetic diversity and population structure of 95 pigeonpea varieties and high resolution mapping of 11 yield related QTLs for number of branches, pod bearing length and number of seeds per pod in a biparental RIL population. As an accurate high-density genotyping tool, ‘CcSNPnks’ chip array will be useful for high resolution fingerprinting, QTL mapping and genome wide as well as gene-based association studies in pigeonpea.


Results
Design and validation of pigeonpea 62 K genic-SNP chip. The 62 K genic-SNP chip described here was designed using an in-house semi-automated pipeline (Fig. 1). For SNP discovery by re-sequencing of 45 diverse pigeonpea varieties (Supplementary Table S1), initially a reduced representation library of EcoRI restriction site-associated DNA (RAD) fragments was sequenced on Illumina HiSeq platform to produce 719,993,266 high-quality sequence reads with average read length of 125 bp, generating 107.95 Gbp of sequence data that provided about 44X sequence coverage around the EcoRI restriction sites in the genome of each variety. However, alignment of sequence reads was limited to the regions flanking the EcoRI sites, which drastically reduced the chance of finding multiple SNPs per gene required for gene haplotype analysis. Therefore, we further sequenced six pools of up to eight of these varieties with equal DNA concentrations and making sequencing libraries of randomly sheared DNA fragments to produce 1,021,072,270 high quality sequence reads with average read length of 150 bp and total 127. 36 Gbp of sequence data that provided 3.2X sequence coverage for entire genome of each variety. Details of re-sequencing data generated in this study are provided in Supplementary Table S2. The raw Illumina sequence reads of individual genotypes and six pools of 6-8 genotypes each utilized for SNP discovery have been submitted to the NCBI SRA database (BioProject No. PRJNA579901). Alignment of all sequence reads from the two data sets to 17,125 largely SC reference genes identified 645,662 high quality SNPs with an overall SNP density of 18.6 SNPs/Kbp ( Table 1). The 17,125 reference genes belonged to five different categories: (i) 10,064 SC genes unique to pigeonpea (SCP), (ii) 5899 SC genes conserved between soybean and pigeonpea (CSCSP), (iii) 192 homologs of agronomically important cloned genes in pigeonpea (AGCP), (iv) 874 disease resistance and defense response genes in pigeonpea (DRDRP), and (v) 96 multi-copy genes in pigeonpea (MCP) used as control (Table 1). Average SNP density in these genes ranged from 6.9/Kbp for the highly conserved CSCSP genes to 111.8/Kbp for the MCP genes, whereas average number of SNPs per gene ranged from 22.1 for CSCSP genes to 234.1 for MCP genes because of the size difference in the genes.
After filtering the 645,662 SNPs in 17,125 genes using Affymetrix filters for Axiom assay design 62,945 SNPs in 9648 genes were found suitable and sent to Affymetrix for chip production. Before chip production, in silico validation was done through a preliminary screening of the designed SNP assays for their p-convert values using Affymetrix power tool (APT) AxiomGTv1 algorithm in order to ensure a high-quality array 27 . For each SNP, both forward and reverse probes were assigned with p-convert values, derived from a random forest model to predict the probability of SNP conversion on the array. This model is based on factors including probe sequence, binding energy and expected degree of non-specific hybridization to multiple genomic regions. The SNP probes having high p-convert values are expected to convert on the SNP array with a high probability. After QC check 62,053 (62 K) SNPs in 9629 genes represented by 71,816 probes with p-convert values of >0.50 were included in the chip, showing very high conversion of our designed Axiom assays into the final chip. The 62 K SNP chip array named 'CcSNPnks' includes 4314 SCP, 4328 CSCSP, 156 AGCP, 746 DRDRP and 85 MCP genes (Table 1). Of the total 62,053 SNPs included in the chip, 24,537 (39.54%) were for SCP, 27,291 (43.98%) for CSCSP, 1,536 (2.47%) for AGCP, 8090 (13.03%) for DRDRP and 599 (0.96%) for MCP genes (Table 1). Overall, 9.6% of the initial 645,662 SNPs and 56.2 percent of the 17,125 reference genes were incorporated in the chip; a large proportion of the SNPs were dropped during various filtration steps of the assay design. Surprisingly, there was low 42.9% inclusion of the SCP genes in the chip as compared to 74.2-85.5% of genes from the other four categories were included.
This predominantly SC genes based chip array has 4314 SCP genes with an average of 5.7 SNPs per gene and 4328 CSCSP genes with an average of 6.3 SNPs per gene (Table 1). Further, it has 156 AGCP genes with an average of 9.9 SNPs per gene and 746 DRDRP genes with an average of 10.8 SNPs per gene. The 85 MCP genes included in chip have an average of 7.1 SNPs per gene, which was the lowest inclusion rate of only 3.6% of the average 234.1 SNPs in the MCP genes. The exclusion of SNPs from the chip was directly related to SNP density in the genes likely due to interference from the neighboring SNPs in the assay. Thus, the maximum inclusion of 28.5% SNPs was for the CSCSP genes, which have the lowest SNP density of 6.9 SNPs/Kbp. The complete array information of this 62 K genic-SNP chip 'CcSNPnks' including probe sequence, Affymetrix probes Id and NIPB SNP Id is provided in Supplementary Table S3. Despite our efforts to include multiple SNPs per gene, there were 1922 genes on the chip with single SNP and further 1517 genes with only two SNPs per gene. There was an inverse relationship between the number of SNPs per gene and the frequency of genes on the chip (Fig. 2) Genetic diversity and population structure of the pigeonpea varieties. The genotyping data from 62 K SNP chip was used to analyze genetic diversity and population structure of 95 pigeonpea varieties. All the 95 genotypes were included in the analysis because of their high SNP call rates and 100% sample success rate  However, these grouping were not strictly according to maturity groups as there were many exceptions. Baysean model based population structure of these genotypes was also analysed using a subset of only 33 genome wide unlinked SNPs with minor allele frequencies of ≥0.30 (Supplementary Tables S6, S7). The maximum delta K was reached at K = 2 and two sub-populations were inferred using Structure Harvester software (Fig. 4b). On the basis of population structure analysis 56 of the 95 varieties belonged to sub-population I and 39 varieties belonged to sub-population II (Fig. 4c). Population structure analysis further resolved these genotypes into pure and admixture types, considering accessions with Fst values of ≥0.80 as pure and those with Fst values below 0.80 as admixture type. Thus, 38 genotypes were of pure type and 18 were of admixture type in sub-population I. Similarly, 23 genotypes were of pure type and 16 were of admixture type in sub-population II. Admixture types are shown with lighter shades in Fig. 4a and Supplementary Table S7.
There was high correspondence between NJ phylogenetic tree clusters based on 30,426 SNPs and Structure sub-populations based on 33 unlinked SNPs (Fig. 4a). The NJ clusters 1 and 2a comprised of varieties largely belonging to sub-population I with exceptions of only three varieties namely PAE-9-569, T-10/9 and ICP 16451.  Table S7). This showed that Bayesian model based population structure analysis using limited number of genome wide unlinked SNP markers is efficient for grouping the varieties according to their ancestry.

Analysis of gene haplotype networks.
A unique feature of the 62 K SNP chip described here is that it incorporates multiple SNPs for 7709 of the 9631 genes with an average of 6.4 SNPs per gene, which allowed haplotype network analysis for these genes (Table 1). However, there was an inverse relationship between the frequency  (Table S3). www.nature.com/scientificreports www.nature.com/scientificreports/ of genes and number of SNPs per gene (Fig. 2). Haplotype networks were analyzed for eight selected genes in the 95 varieties, taking two genes each from SCP, CSCSP, AGCP and DRDRP categories with the highest number of SNPs in their category, annotated functions and hundred percent genotyping call rates ( Table 2).
Disease resistance gene DRDRP-234 coding for a 'serine threonine kinase' with 116 SNPs showed 14 haplotypes and the highest haplotype diversity of 0.7380 among the eight genes analysed (Fig. 5g, Table 2). It has three major haplotypes DRDRP-234-H1, DRDRP-234-H2 and DRDRP-234-H4 present in 22, 28 and 34, varieties, respectively. The second most predominant allele DRDRP-234-H1 was the ancestral haplotype giving rise to five minor alleles, two of which were roots for further evolution of eight alleles including two major haplotypes. There were three independent routes for the evolution of DRDRP-234 haplotypes. One was directly from DRDRP-234-H1, second one was from DRDRP-234-H1 through H10 to H11 to H4 giving rise to haplotypes H5, H8 and H11, while the third route was from DRDRP-234-H1 through H3 to H2 which gave rise to haplotypes H9 and H12. Disease resistance gene DRDRP-601 coding for a 'TMV resistance N-like' protein with 104 SNPs showed the highest number of 20 haplotypes and a high Hd value of 0.5158 (Fig. 5h, Table 2). The most predominant www.nature.com/scientificreports www.nature.com/scientificreports/ haplotype DRDRP-601-H4 was present in 66 varieties, haplotypes H1, H3, H6, H8, H10 and H12 were present in 2-5 varieties, while the remaining 13 haplotypes were present in one genotype each. Haplotype network analysis showed a complex evolutionary pattern for this gene with four different nodes of haplotype diversification.
Further, we examined if genotyping data on all of the large number of SNPs in a gene was needed to get the full haplotype information by taking AGCP-21 gene with 128 SNPs as an example. For this we selected three different random sets of 10 SNPs each and one random set each of 20, 40, 60, 80 and all 128 SNPs for the haplotype analysis. The number of haplotypes for the AGCP-21 gene ranged from two to four with different sets of 10 SNPs and increased successively with increasing number of SNP to maximum 12 with all the 128 SNPs ( Supplementary  Fig. 2). This shows that it is necessary to have data on as many SNPs as possible to get the full haplotype information, particularly the rare alleles.
Indentification of QTLs for yield related traits. Number of branches, pod bearing length, seeds per pod and hundred seed weight are important yield contributing traits in pigeonpea. An intraspecific RIL population derived from cross between contrasting varieties Pusa Dwarf and H2001-4 was used for mapping QTLs for some of these traits using the 62 K SNP chip. High-density linkage maps of eleven pigeonpea chromosomes were constructed with 2078 filtered SNP markers (Supplementary Fig. 3). Out of total 62,053 SNPs on the chip, 11,033 were polymorphic between the two parents. These markers were further filtered for homozygous SNP calls and chi-square test of significance for 1:1 segregation ratio to obtain 3108 markers which were subjected to map construction using JoinMap. Finally, 2078 SNP markers were unambiguopusly assigned to the eleven linkage groups of pigeonpea. The remaining markers were discarded due to significant similarity of loci. The total map length was 1100.53 cM with an average marker interval of 0.88 cM (Table 3) QTLs mapping for four important yield contributing traits, namely number of primary branches per plant (PB), pod bearing length on main axis (PBLm), pod bearing length on primary branches (PBLp) and number of seeds per pod (SP) were mapped based on phenotypic data from 66 RILs (Supplementary Table S8) evaluated in an augmented design with repetition of the two parental lines after every 10 RILs. Transgressive segregation with normal frequency distribution among the RILs was observed for all the four traits ( Supplementary Fig. 4). Eleven major QTLs, each explaining more than 10% of the phenotypic variance were identified (Table 4). Two QTLs for number of primary branches per plant (qPB3.1 and qPB5.1), each explaining 13% of the phenotypic variance were mapped on LG3 and LG5 in narrow QTL intervals of 1.34 cM and 3.04 cM, respectively. Additive effects for both the QTLs for number of primary branches were contributed by Pusa Dwarf. A major QTL for pod bearing length on the main www.nature.com/scientificreports www.nature.com/scientificreports/ axis (qPBLm3.1), contributing 12% of the phenotypic variance was identified in a 0.47 cM map interval on LG3 with additive effect contributed by the parent H2001-4. There were five QTLs for pod bearing length on the primary branches explaining phenotypic variance in the range of 13-24% and all of these except one were contributed by H2001-4 (Table 4). Only QTL qPBLp2.1 located on LG2 was contributed by Pusa Dwarf, three QTLs namely qPBLp7.1, qPBLp7.2 and qPBLp7.3 contributed by H2001-4 were mapped in a closely linked genomic region on LG7, whereas an independent QTL qPBLp10.1 also contributed by H2001-4 was mapped on LG10 (Table 4). Three major QTLs for number of seeds per pod, qSP3.1, qSP5.1 and qSP10.1, contributing 17%, 64% and 47% of the phenotypic variance were mapped on LG3, LG5 and LG10, respectively. Among these, additive effect of qSP3.1 was contributed by H2001-4 while the remaining two QTLs were contributed by Pusa Dwarf. The strongest of the three QTLs, qSP5.1 with a LOD score of 38.4 and 64% phenotypic variance explained was located in a narrow map interval of 0.33 cM on LG5 flanked by SNP markers CSCSP6821 and CSCSP15851 (Table 4). High-density linkage maps of five pigeonpea chromosomes, LG2, LG3, LG5, LG7 and LG10 developed using RIL population and showing QTL locations for above four important yield contributing traits have been depicted in Fig. 6.

Discussion
A major limitation in genetic dissection of complex traits in pigeonpea has been the unavailability of high-density genotyping assays that can be utilized effectively for the genetic improvement of the crop 22 . Hence, we used the improved second draft of the pigeonpea genome 26 , and re-sequencing data of 45 diverse pigeonpea varieties to develop a largely SC genes based 62 K Axiom genic-SNP chip for this purpose. Re-sequencing of 45 diverse  www.nature.com/scientificreports www.nature.com/scientificreports/ pigeonpea varieties identified total 645,662 SNPs in a set of 17,125 reference genes with total sequence length of 34.67 Mbp. The observed SNP density of 18.6/Kbp in these genes shows that pigeonpea has one SNP every 54 bp, which is quite high contrary to the general perception of low polymorphism based on SSR markers 2,5 . The 62 K SNP chip incorporated total 62,053 SNPs in 9629 genes with an average of 6.4 SNPs per gene. Overall 9.6% of the initial 645,662 SNPs and 56.2% of the 17,125 reference genes were included in the chip; remaining SNPs were dropped during various filtration steps of the assay design viz. (i) no interfering SNP within10 bp on either side of the target SNP (ii) sequence information of 72 bp with SNP at the 36th bp position. Average SNP density in different categories of genes ranged from 6.9/Kbp in the highly conserved CSCSP genes to 111.8/Kbp in the MCP genes, whereas average number of SNPs per gene ranged from 22.1 for CSCSP genes to 234.1 for MCP genes. Surprisingly, there was very low inclusion of only 42.9% of the SCP genes in the array as compared to 74.2-85.5% of the genes from the other four categories. Likely reason for this is high SNP density of the SCP genes (35.2 SNPs/ kbp) as compared to the CSCSP genes (6.9 SNPs /kbp) resulting in large number of interfering SNPs during assay design. The exclusion of SNPs was directly related to the SNP density because of interference from the neighboring SNPs in the assay. Therefore, the maximum inclusion of 28.5% SNPs was for the CSCSP genes, which have the lowest SNP density of 6.9 SNPs/Kbp (Table 1). Even though the SNP density was highest for the MCP genes, existence of very large number of average 234.1 SNPs per gene ensured that at least some of these were included in the assay. Pigeonpea was an orphan crop until the publication of its first draft genome 1 . Later genomic resources have been enhanced to a large extent but a quality finished reference genome is still awaited 1,26 . Among various types of DNA markers, SNPs are the most abundant and robust marker for the development of high-throughput genotyping assays for genetic studies and molecular breeding applications [28][29][30][31] . Conventional breeding needs expert breeders and a longer time frame of 12-15 years to develop a new improved variety because of low heritability of complex agronomic traits, whereas SNP chip arrays can be used to genotype thousands of markers in a short time for linkage mapping and association studies to identify DNA markers linked to these complex traits, which can be utilized for marker-assisted selection to reduce the breeding cycle by nearly half. SNP chip arrays are relatively simple to use in comparison to GBS that has high informatics cost and paucity of commonly genotyped loci across samples, therefore chip array is preferred over whole genome NGS based genotyping 32 . We need ultra high-density linkage map for genetic anchoring and a reference quality finishing of the available draft genomes pigeonpea 1,26,33 , which will be easy to develop using the 62 K SNP array. Also, we need trait-specific high resolution QTL mapping for foreground selection in marker-assisted breeding, which can be easily acomplished by the use of 62 K SNP chip. The 62 K SNP chip array designed here for Affymetrix Axiom ® platform is highly useful for genetic studies and molecular breeding applications in pigeonpea. It has very high sample success rate (99.7%), SNP genotyping call rate (99.9%) and assay reproducibility (99.9%) because of largely SC genes based assay design, as clearly seen by the lowest proportion of only 86.8% SNPs having call rates call rates of >95.5% in the MCP genes used as control. The MCP genes based SNPs showed 5% lower call rates thus clearly demonstrating the advantage of SC genes based array.
The CSCSP genes showed almost one-fourth the number of SNPs per gene than the SCP genes due to their highly conserved nature. Here we did a comprehensive analysis of all single-copy genes in the pigeonpea genome and found that they belonged to two different categories: (i) SCP genes, which are unique to pigeonpea, and (ii) CSCSP genes, which are conserved between pigeonpea and soybean. We have reported similar results earlier in case of rice, which also has a set of single-copy genes that are unique to rice . Hence, although conserved genes are largely of single-copy or low copy in nature, all single-copy genes are not conserved. Pigeonpea and soybean are estimated to have evolved from a common ancestor approximately 20-30 million years ago 33 . Therefore, these genes are an important resource for deciphering the evolutionary history of pigeonpea. Also, comparative analysis of SCP and MCP genes showed that SNPs in MCP genes have significantly lower call rates underlining the higher success rate of the SC genes based array. The MCP genes have very high SNP density with almost one SNP every ten base pairs of the gene, but >95% of these were dropped during assay design due to interfering SNPs. This is the first genic-SNP genotyping chip array for pigeonpea, the available 56 K SNP chip incorporates largely intergenic SNPs 22 . The only other available high-density genic-SNP chip is SC genes based 50 K SNP chip for rice 14 . A 44 K rice SNP array comprising 44,100 SNPs from across the 12 rice chromosomes, having both genic and non-genic regions has been reported but the average SNP call rate in this chip was <92% with 4.5% of missing data 11 . Similarly, a 50 K RiceSNP50 array developed for Illumina Infinium platform has 51,478 genome-wide SNPs with only 68% coming from genic regions 13 . The utility of present 62 K SNP array was demonstrated for analysis of population structure of 95 pigeonpea genotypes inferring two sub-populations, analysis of genetic diversity, haplotype analysis for four categories of genes on the chip, construction of high-density linkage map and identification of QTLs for yield related traits in a RIL mapping population of pigeonpea (Pusa Dwarf/H2001-4). Bayesan model based population structure analysis using a subset of 33 genome wide unlinked SNPs with 100% call rates revealed two sub-populations. Genetic diversity analysis of the same set of 95 genotypes based on 30,426 SNPs with 100% call rates showed clear correspondence between NJ phylogenetic tree and population structure. Varieties in two of the four clusters of NJ phylogenetic tree showed correspondence with sub-population I, whereas those in the remaining two clusters belonged to sub-population II with very few exceptions. The comparison of clusters obtained with 62 K SNP array clearly showed that even by using a limited number of genome wide unlinked markers, varieties can be grouped according to their ancestry in an effective manner. We have also incorporated in this chip pigeonpea homologs of 156 agronomically important cloned genes in related species, which will be useful for geneticist trying to identify novel gene haplotypes linked to quality and agronomic performance of the crop. Haplotype analysis of red pericarp (Rc1), grain size (GS3), granule bound starch synthase (GBSS1) and soluble starch synthase (SSS1, SSIIa, SSIIb, SSIIIa, SSIIIb, SSIVa, SSIVb) genes of wild and cultivated rice using a similar 50 K SNP array provided clear evidence for convergent evolution of the GBBS1 gene in wild rice and a polyphyletic orgin of the cultivated rice (Singh et al., 2017) 14,34 . Similar analysis can be conducted using 62 K SNP array in pigeonpea germplasm. High-density linkage map of pigeonpea with 2078 SNP markers developed in the present study will be useful for the anchoring of pigeonpea genome, apart from the high resolution mapping of QTLs for important yield related traits. Most of these QTLs were mapped to a narrow genetic interval of less than 1.0 cM with high LOD scores, which is needed for use in marker-assisted selection. Transgressive segregation with normal frequency distribution in the RIL population was observed for all four traits, indicating that trait enahancing genes were present in both the parents that can by pyramided for www.nature.com/scientificreports www.nature.com/scientificreports/ yield improvement. Incorporation of 746 disease resistance and defense response genes in the array with average 10.8 SNPs per gene will be useful for pathologists and breeders in identifying genes for biotic stress resistance in pigeonpea 35 .
To date more than fifteen GBS platforms and 50 SNP arrays have been developed and utilized in 25 different plant species 36 . Despite availability of a range of multiplexing platforms, like genome wide GBS, targeted GBS, high-density arrays, medium-density arrays, low level multiplexing and single marker assays (e.g. Sequenom (Agena), KASP and Taqman platforms), high throughput genotyping platforms are not routinely used for crop breeding applications partly due to cost considerations. Cost per sample is directly proportional to the level of multiplexing which in turn depends on the exact requirement of the number of markers and samples. Despite high per sample cost high-density SNP arrays and whole genome GBS platforms have much lower costs per data point as compared to targeted GBS and medium to low multiplexing platforms 36 . High-density SNP arrays are required for specific crop breeding applications, particularly high-density fingerprinting, association studies and background selection. Thus, the 56 K SNP array developed by Saxena et al. 22 and the present 62 K SNP array will be useful for pigeonpea genetic studies and breeding programme. Advantage of the 62 K SNP array is that it is a fully genic-SNP array based on single-copy genes, disease resistance genes and agronomically important genes. Another advantage is that it incorporates multiple SNPs per gene allowing scoring of haplotype (allele) information on the genes, which is necessary for allele based association mapping and selection strategies.
In conclusion the high-density 62 K genic-SNP array 'CcSNPnks' described here is a unique genotyping platform for pigeonpea. It has 62,053 SNPs from 9629 genes belonging to five different categories with more than 99% sample success rate and 99.9% SNP call rate. The usefulness of the array was demonstrated in population structure analysis, diversity analysis, phylogentic study, construction of high density linkage map, QTL mapping and gene haplotype analysis. Incorporation of multiple SNPs per gene is a unique feature of the array that allows gene haplotype analysis not possible with other available chips. Hence, it is an excellent platform for genetic and evolutionary studies as well as breeding applications in pigeonpea. It is an efficient tool for generating high-density linkage maps suitable for anchoring of the pigeonpea genome because of large number of markers genotyped across samples. In addition, the array would be useful for association mapping studies for correlating DNA polymorphism with phenotypic trait variation.  Table S1). A subset of 45 varieties from this was re-sequenced for SNP discovery. An F 8 recombinant inbred line (RIL) population derived from cross between Pusa Dwarf and H2001-4 used for QTL mapping was developed by Division of Genetics, ICAR-Indian Agricultural Research Institute, New Delhi.

Methods
Selection of genes for the Axiom SNP chip array design. CDS sequences of 56,888 protein-coding genes predicted from the second draft of the pigeonpea genome were used for extracting five categories of genes for SNP mining and chip design 26 . The SC genes were identified by BLASTN search of each gene against all the 56,888 genes and recording the top five hits. Genes showing only self-match or second match with a bit score of <200 were taken as SC genes. For identification of SCP and CSCSP genes, we downloaded all the CDS sequences of three different legume species, viz. Cicer arietinum, Glycine max and Medicago truncatula from the GIGAd band Phytozome database (https://phytozome.jgi.doe.gov/pz/portal.html). We performed BLASTN search of all the pigeonpea SC genes against a locally created database of CDS sequences of the three legumes using pre-optimized search parameters 37 . The SCP genes did not show significant homology with any of the three species whereas CSCSP genes showed homology between pigeonpea and soybean. The AGCP genes were identified by BLASTN search of a manually curated list of cloned genes for agronomic traits in major legumes and cereals obtained through literature search against the database of 56,888 pigeonpea genes at a cutoff bit score of ≥100. In addition, 874 DRDRP genes were selected from the annotations of the first draft of the pigeonpea genome 1 . Further, we included 96 MC genes as a control to make the final set of 17,125 reference genes. In order to get information on cis-regulatory elements the reference genes were extracted in full-length along with 500 bp of upstream and 100 bp of downstream sequences from the translation start and stop codons, respectively.
Re-sequencing, SNP discovery and array design. To identify SNPs in the reference set of genes we generated Illumina Hiseq pair-end sequence data for 45 pigeonpea genotypes (Supplementary Table S2), first using EcoR1 RAD sequencing of individually bar-coded samples for high depth of coverage, but subsequently random sheared DNA fragments in six pools of up to 8 varieties each for greater horizontal coverage of SNPs. We used an in-house pipeline for SNP mining and probe development for the Axiom SNP assays (Fig. 1). High quality SNPs were called by VarScan version 2.4.1 software using search parameters: min-coverage = 10x, min-reads = 2, min-avg-qual = 25 and min-var-freq = 0.05. The identified SNPs were filtered as per the Affemetrix requirements for the Axiom chip array, viz. (i) no interfering SNP within10 bp on either side of the target SNP (ii) SNP plex information of 72 bp with SNP at the 36th bp position. The SNP-plex information of selected SNPs was sent to Affymetrix Bioinformatics Services, San Diego for in-silico probe converting test, which evaluates the quality of each SNP on the basis of p-convert score using Affymetrix power tool (APT) (http://www.affymetrix.com/estore/ partners_programs/programs/developer/tools/powertools.affx). Forward and reverse probes for each SNP assay were assigned p-convert values in the range of 0.0 to 1.0. Based on the p-convert values, probes were designed for each SNP and designated as 'neutral' , 'recommended' , 'not recommended' and 'not possible' to help easy filtration Genotyping with 62 K SNP chip. Pigeonpea genomic DNA was isolated from young seedlings using CTAB method 38 , quality checked by electrophoresis in 1% agarose gel and quantified using Nano drop spectrophotometer. For target probe preparation, 20 μl of genomic DNA with concentration of 10 ng/ul was used according to Affymetrix Axiom ® 2.0 Assay Manual. The samples were amplified using Target Prep Protocol QSCB1 (P/N 702990), fragmented, hybridized on the chip followed by single-base extension through DNA ligation and signal amplification. Affymetrix GeneTitan ® Multi-Channel Instrument was used for staining, washing and scanning of the chip signals as per the manufacturer's protocol (http://media.affymetrix.com/support/downloads/manuals/axiom_2_assay_auto_workflow_user_guide.pdf). SNP allele calling was done using Axiom ™ Analysis Suite version 2.0 using its three workflows i.e., Best Practices, Sample QC, Genotyping and Summary (http://media. affymetrix.com/support/downloads/manuals/axiom_analysis_suite_user_guide.pdf) on the Affymetrix Gene Titan.CEL files. The Axiom Analysis Suite requires stored library files to convert.CEL files into genotype calls. SNPs with low call rate across the samples were removed and only good quality SNPs with a DQC of >0.85 and call rates of >95% were retained for further analysis.
Analysis of genetic diversity and population structure. Genotyping data were extracted and converted to PLINK 39 .ped and .map format and imported in the TASSEL 3.2.1 software 40 . A NJ phylogenetic tree of the 95 varieties was constructed based on 30,426 SNPs with hundred percent call rates using an improved version of neighbor-joining algorithm and visualized using iTOL software 41 . From this a subset of 33 genome wide unlinked SNPs, three from each chromosome located near the telomeres and centromere and having minor allele frequency of ≥0.3, were selected for analysis of population structure of the 95 genotypes. The model-based Bayesian approach was employed to infer population structure using STRUCTURE v. 2.3.4 software 42 . The project was run with the admixture model and correlated allele frequency using a burn in period of 50,000 and 50,000 Markov chain Monte Carlo (MCMC) replications. Five independent runs were performed with each K value ranging from 1 to 10. Evanno's Δk value was obtained by processing the STRUCTURE result using STRUCTURE HARVESTER software 43 . Analysis of gene haplotype network. This analysis was performed to show the utility of 62 K SNP chip in generating gene haplotypes, which is not possible with most of the available SNP chips. For this we took two genes from each of the SCP, AGCP, CSCSP and DRDRP categories present in the chip with high number of SNPs and call rates of ≥95%. Total eight genes were selected and their SNP haplotypes were generated using TASSEL 3.2.1 software 40 in a sliding window of five nucleotides length. Haplotype networks were constructed for analysis of genealogical relationship using Network software 44 and haplotype diversity was calculated using DnaSP software version 5.10 45 . We also analyzed the effect of taking different numbers of SNPs on the haplotype patterns of a given gene with large number of SNPs. For this we selected AGCP-21 gene with 128 SNPs and took random sets of 10, 20, 40, 60, 80 and 128 SNPs for haplotype network analysis. Three sets of 10 random SNPs were analyzed to check the effect of sampling on the haplotype.

Mapping of QTLs for yield related traits.
A set of 94 F 8 RILs derived from cross Pusa Dwarf/ H2001-4 was used. For genotyping using 62 K SNP chip the DNA from 94 RILs along with the two parents were amplified, fragmented and hybridized on chip for SNP allele calling using Axiom ™ Analysis Suite version 2.0. The genotyping data was generated for 94 RILs, alleles assigned by comparing with the two parents and linkage maps were constructed using JoinMap 4.0 software 46 . Chi-square analysis was performed on the genotyping data to test the goodness of fit against the expected 1:1 segregation ratio. The locus genotyping frequencies of JoinMap were used to identify and discard markers with aberrant segregation ratio at a cutoff P value of 0.05. Linkage groups were constructed by grouping of markers at a minimum LOD threshold of 3 to a maximum of 10 with a step of 0.5. The groups were converted to maps at LOD 3 using regression algorithm with a cutoff recombination frequency of 0.20 performing a ripple after adding 2 loci. Kosambi mapping function was used for converting recombination frequency to map distance in cM.
QTL analysis was performed for four yield related traits segregating in the RILs using WinQTL Cartographer v2.5 47 using adjusted means of phenotypic data for the four traits from 66 RILs evaluated in an augment design. To find marker-trait associations, composite interval mapping (CIM) 48 was performed using the ZmapQTL standard model 6 with a window size of 10 cM and 2 cM walk speed. A 1000-permutation test was performed for estimating genome-wide LOD score threshold for QTLs (P = 0.05) 49 . The additive effect and the percentage of phenotypic variation explained by each QTL were also estimated by the CIM method.