Development and Evaluation of a High Density Genotyping ‘Axiom_Arachis’ Array with 58 K SNPs for Accelerating Genetics and Breeding in Groundnut

Single nucleotide polymorphisms (SNPs) are the most abundant DNA sequence variation in the genomes which can be used to associate genotypic variation to the phenotype. Therefore, availability of a high-density SNP array with uniform genome coverage can advance genetic studies and breeding applications. Here we report the development of a high-density SNP array ‘Axiom_Arachis’ with 58 K SNPs and its utility in groundnut genetic diversity study. In this context, from a total of 163,782 SNPs derived from DNA resequencing and RNA-sequencing of 41 groundnut accessions and wild diploid ancestors, a total of 58,233 unique and informative SNPs were selected for developing the array. In addition to cultivated groundnuts (Arachis hypogaea), fair representation was kept for other diploids (A. duranensis, A. stenosperma, A. cardenasii, A. magna and A. batizocoi). Genotyping of the groundnut ‘Reference Set’ containing 300 genotypes identified 44,424 polymorphic SNPs and genetic diversity analysis provided in-depth insights into the genetic architecture of this material. The availability of the high-density SNP array ‘Axiom_Arachis’ with 58 K SNPs will accelerate the process of high resolution trait genetics and molecular breeding in cultivated groundnut.


Results
SNP selection and array design. The analysis of the sequencing data generated from 41 genotypes (30 tetraploids and 11 diploids) against the genomes of two groundnut progenitors i.e., A. duranensis (A subgenome) and A. ipaensis (B subgenome) (Supplementary Table S1) yielded a total of 163,782 SNPs i.e., 98,375 SNPs from A subgenome and 65,407 SNPs from B subgenome (Fig. 1a). Of the 41 genotypes, sequence analysis of 30 tetraploid genotypes identified 118,860 SNPs (58,438 SNPs from A subgenome and 60,422 SNPs from B subgenome) while 11 diploid genotypes yielded 44,922 SNPs (39,937 SNPs from A subgenome and 4,985 SNPs from B subgenome). Among 30 tetraploid genotypes, analysis of WGRS data for 27 genotypes yielded 113, 835 SNPs (58,438 SNPs from A subgenome and 55,397 SNPs from B subgenome) and RNAseq data of three tetraploid genotypes yielded 5,025 SNPs from B subgenome.
All the identified 163,782 SNPs were subjected to filtering to select SNPs of good quality. The above SNP set also had 52 highly informative SNPs associated with resistance to foliar fungal diseases and oil quality. During filtering process, a total of 96,858 SNPs were discarded as 46,205 SNPs were found on both genomes, 50,642 SNPs were present on either of the two strands of DNA, and 11 SNPs were found identical. As a result, only 66,924 SNPs passed the filtering test (Fig. 1a). From this set, 825 SNPs, however, were further removed because of ambiguity and multi-allelic nature of these SNPs, leaving 66,099 good quality SNPs. From the set of 66,099 good quality SNPs, finally, 58,233 SNPs uniformly distributed across A and B subgenomes were tiled on the Axiom_Arachis array (Supplementary Table S2).
The functional annotation information was used to categorize the SNPs into different categories i.e., biological processes, molecular function, and cellular component (Fig. 1c). A majority of the SNPs found in genes were classified into cellular component followed by biological process and molecular function. SNPs underlying the genes coding for extracellular, periplasmic space proteins and involved in antioxidant activity were specifically found to be enriched in A. duranensis genome but not in A. ipaensis genome. On the other hand, genes involved in reproductive processes and riboflavin synthase complex were enriched in the A. ipaensis genome but not in the A. duranensis genome. Cell and cell part, binding and catalytic activity, and cellular and metabolic process were the most representative terms in cellular component, molecular function and biological process category.
Genome-wide distribution of selected SNPs. Selected Table S4). Genotyping of this set with the SNP array resulted in generation of genotyping data for 58,233 SNPs on 297 genotypes as QC for allele call failed for three samples. Upon identifying the polymorphic SNPs separately for the diploid species genotypes and tetraploid species genotypes, 40,714 polymorphic SNPs were identified on a panel of 36 wild genotypes while 9,312 polymorphic SNPs in the set of 264 cultivated tetraploid genotypes. Comparison of SNPs identified in the above two sets resulted in the identification of 5,625 common SNPs (Table 2). Subsequently, 44,424 polymorphic SNPs were identified through combined analysis of all the 297 genotypes of the 'Reference Set' and were used for further genetic analysis.
Of the 44,424 polymorphic SNPs, 23,559 SNPs were from A subgenome while 20,865 SNPs from B subgenome and achieved an average of 2,221 polymorphic SNPs per pseudomolecule (Fig. 2f, Table 2). An average 2,356 polymorphic SNPs per pseudomolecule were from A subgenome and the number of SNPs ranged from 1,822  Table S5). Minor allele frequency ranged from zero (1,814 SNPs) to 0.50 (52 SNPs) with an estimated average of 0.08. Similarly, heterozygosity in the population ranged from zero (7,842 SNPs) to 0.87 (AX-147231295) with estimated average of 0.02. Polymorphic information content (PIC) value for SNPs on the array ranged from 0.01 (5,420 SNPs) to 0.50 (608 SNPs) with an estimated average of 0.13 (Supplementary Table S5) in the population.
Genetic analysis of the 'Reference Set'. Generation of high throughput SNP genotyping data on the 'Reference Set' provided an opportunity to gain deeper insights into the genetic relatedness among the genotypes and also the genetic architecture of this important germplasm set (Fig. 4a). The genetic diversity analysis with 44,424 polymorphic SNPs identified four clusters (Cluster-I, Cluster-II, Cluster-III and Cluster-IV) (Fig. 4b, Supplementary Table S6). The Cluster-I consisted of genotypes of the diploid (wild) species while Cluster-II, Cluster-III and Cluster-IV consisted of tetraploid genotypes (A. hypogaea). Among the tetraploid groups, the Cluster-II had genotypes form hypogaea subspecies, Cluster-III had both the subspecies i.e., hypogaea and  fastigiata while Cluster-IV had genotypes from fastigiata subspecies. The diverse lines with maximum genetic distance within cultivated groups can be used for developing genetic and breeding populations for both mapping traits as well as for developing improved varieties with desirable agronomic traits and enhanced genetic base. This array has also shown significant loss of diversity in the cultivated gene pool and preferential selection of genomic regions in these subspecies (Fig. 4c). The loss of genetic diversity is clearly visible in cultivated genotypes i.e., three clusters (Cluster-II, III and IV) upon comparison with wild species accessions grouped together in Cluster-I. Most importantly, few genomic regions were found conserved and were specific to subspecies fastigiata and hypogaea of the A. hypogaea (cultivated tetraploid). For example, the genomic regions conserved to subspecies fastigiata were observed on pseudomolecules A02, A06, A07 and A10 of A subgenome while B01, B07 and B09 of B subgenome. Similarly, the genomic regions conserved to subspecies hypogaea were observed in pseudomolecules A04 of A subgenome while B04, B08 and B10 of B subgenome.   Table S7). Of the total identified 809 high frequency SNPs, 94 SNPs were from subspecies fastigiata and 517 SNPs from subspecies hypogaea. No common SNP was detected between subspecies fastigiata and wild accessions, while, 198 SNPs were found common between subspecies hypogaea and wild accessions. The prediction of effect for these 198 SNPs indicated their location in 133 genes having missense or nonsense mutations. The functions of these genes and their association in various biological functions has been described in the Supplementary Table S8. The enrichment analysis using GO ids showed that majority of genes have binding, catalytic and transporter functions and are involved in catalytic and cellular processes (Supplementary Figure S1).  Discussion High density genotyping 'Axiom_Arachis' array. A high density SNP genotyping array with uniform genome coverage is must in any crop for conducting high resolution trait mapping 2,19 . Development of SNP array for high throughput genotyping was very much required in the case of groundnut due to its large genome size and low genetic diversity in the cultivated gene pool 2,16,19 . Availability of Axiom_Arachis array with 58,233 SNPs to the Arachis community provides an opportunity to generate high throughput genotyping data on different types of genetic and breeding populations for accelerating genetic diversity, high resolution trait mapping and breeding applications. Similar arrays were developed recently in other crop species such as rice (44 K by McCouch et al. 5 ; 50 K by Chen et al. 6 ; 50 K by Singh et al. 20 ), sunflower (11 K SNPs by Bachlava et al. 9 ), soybean (50 K SNPs by Song et al. 10 ), oil palm (171 K SNPs by Kwong et al. 21 ), maize (58 K SNPs by Ganal et al. 7 ) and wheat (90 K SNPs by Wang et al. 13 ). Much higher density genotyping arrays are available in animal species like chicken (600 K SNPs by Kranis et al. 22 ), cattle (648 K by Rincon et al. 23 ) and human (900 K SNPs by Kathiresan et al. 24 ). In the case of plants, 819 K SNPs arrays for wheat 14 and 600 K SNPs arrays for maize 8 are the most-dense publicly available genotyping arrays.

Fair representation of genome and Arachis species. The genome size of A and B subgenomes is
reported to be 1,070 Mb and 1,360 Mb, respectively 17 . Axiom_Arachis array developed in this study has fair genome representation i.e., 51.5% from A subgenome and 48.5% from B subgenome with an average 2,998 and 2,825 SNPs per pseudomolecule for A and B subgenome, respectively. Also this array achieved high density genome coverage of 1 SNP per 42 Kb in tetraploid genome while 1 SNP per 36 Kb in A subgenome and 1 SNP per 48 Kb in B subgenome. The above density is comparable to other recently developed SNP arrays in maize 7 , rice 6 and oil palm 21 .

Insights on genetic diversity and genetic relationship. The newly developed Axiom_Arachis
SNP array was deployed to study genetic diversity and genetic relatedness in the 'Reference Set' developed by ICRISAT 25 . This set has 300 individuals of which 264 are cultivated tetraploid while 36 are wild accessions. Out of 36 wild accessions, 34 were diploid while two accessions were tetraploid belonging to A. monticola. High quality SNP genotyping data was generated successfully for all except three genotypes of the panel mainly due to poor QC for allele call. The polymorphism rate was four times higher in the smaller set of wild genotypes than the larger set of cultivated genotypes with mere ~10% common SNPs between both the sets.
After removing the common SNPs, 77.6% SNPs showed polymorphism in the 'Reference Set' with comparatively higher rate of polymorphism in A subgenome (53.0%) than the B subgenome (47% Mean major allele frequency, minor allele frequency, heterozygosity and PIC was found to be 0.92, 0.08, 0.02 and 0.13, respectively in the 'Reference Set' . Phylogenetic analysis clearly grouped all the genotypes of the 'Reference Set' in four groups. As expected, the wild genotypes were grouped together in Cluster-I while cultivated genotypes (A. hypogaea) were clustered into three distinct groups. Majority of the genotypes from hypogaea subspecies were grouped together in Cluster-II, both the subspecies i.e., hypogaea and fastigiata genotypes in Cluster-III and genotypes of fastigiata subspecies in Cluster-IV. The grouping of different subspecies genotypes using the SNP array was found much better than the earlier studies conducted with SSR and DArT genotyping on the same germplasm set 26,27 . Conserved genomic regions harboring domestication related genes. The cultivated groundnut crop across the world can be divided into four market types from two subspecies. The subspecies A. hypogaea hypogaea do not flower on main stem, have alternate branching patterns, mature later and produce large seeds while A. hypogaea fastigiata produce flowers on the main stem, have sequential branching patterns, mature earlier and produce smaller seeds 28 . This array has demonstrated immense power not only in grouping the genotypes of different subspecies of A. hypogaea but also showing preferential selection of genomic regions in these subspecies. Significant loss of diversity can be clearly observed in all three clusters representing cultivated tetraploid genotypes as compared to the cluster representing wild species accessions. This study indicated that during the evolution of subspecies fastigiata and hypogaea of the A. hypogaea (cultivated tetraploid), selective genomic regions remained conserved. These genomic regions might be harboring genes that are responsible for maintaining subspecies specific features.
Scientific RepoRts | 7:40577 | DOI: 10.1038/srep40577 In order to get further insights, we identified the pseudomolecule-wise subspecies specific high frequency SNPs for fastigiata and hypogaea. Prediction of 198 common SNPs between subspecies hypogaea and wild accessions indicated their location in 133 genes including plant defense against biotic and abiotic stresses, cellular growth and development, seed and pollen development. More importantly genes related to domestication traits such as skotomorphogenesis, flowering time (Aradu.1A8NN.1), seed maturity and germination (Aradu.PZ509.1), lateral root development (Aradu.895HT.1), stem elongation (Aradu.Y6SZD.1) and self-incompatibility (Araip.D2CP3.1) have shown genomic variation between these two subspecies. Such variation for domestication related traits might be playing an important role in retaining the basic features of these two subspecies during the course of evolution.
In summary, the present study reports development of Axiom_Arachis array with 58 K informative SNPs and its successful deployment in understanding the genetic diversity of ICRISAT 'Reference Set' in groundnut. This array is an important genomic resource for the Arachis and especially the groundnut community that will be useful not only for accelerating genetics and breeding applications but also to understand evolutionary biology in Arachis species. DNA/RNA isolation, sequencing and SNP identification. High quality DNA isolation using modified CTAB-based method followed by quantification and quality check of DNA was done as mentioned in Mace et al. 30 . The WGRS data were generated for 21 tetraploid accessions and 4 diploids at UGA, USA; and 6 tetraploids (TAG 24, GPBD 4 and 4 tetraploid pooled samples for foliar disease resistance) at ICRISAT, India. The RNA-seq data for 3 tetraploid genotypes (ICGV 91114, JL 24 and J 11) were also generated at ICRISAT, India. The WGRS data for 7 diploids (PI 475845, ICG 8138, ICG 8960, ICG 8209, ICG 13160, ICG 8206 and ICG 8123) were generated at Macrogen Inc., South Korea.

Materials and Methods
The raw sequences obtained were filtered using various softwares to get high quality reads for downstream processing. Briefly, the adapter sequences were trimmed using Cutadapt v1.2.1 31 while quality trimming was carried out using TrimGalore v0.3.7 software. Such high quality sequences were mapped against the two diploid genomes (A and B subgenomes represented by A. duranensis and A. ipaensis) 17 with Bowtie2 32 and SNPs were identified. Further, the homeologous SNPs were removed using SWEEP Prime version program 33 . Subsequently, SNPs which were present within 10 kb of A. duranensis and A. ipaensis annotated genes 17 were used for downstream processing. The 35 bp sequences flanking to both side of selected SNPs were extracted using custom script and searched against A and B subgenomes for uniqueness using BLASTN program. Finally, the SNPs showing unique hit of ≥ 94% identity or across at least 60 aligned bases were selected for array development.
Array design using selected SNPs. The selected SNPs representing 10 pseudomolecules each for A subgenome and B subgenome following the above mentioned criteria were subjected to in silico validation. The in silico validation of the assay involved preliminary screening of the designed array file for each selected SNP, including their p-convert values generated using Affymetrix power tool (APT) AxiomGTv1 algorithm to ensure a high-quality final array (http://www.affymetrix.com/estore/partners_programs/programs/developer/tools/powertools.affx). Both forward and reverse probes of each SNP were assigned with p-convert values, derived from a random forest model to predict the probability of SNP conversion on the array. The model considers factors including the probe sequence, binding energy, and expected degree of non-specific hybridization to multiple genomic regions. SNP probes with high p-convert values are expected to convert on the SNP array with a high probability. Potential probes were designed for each SNP in both the forward and reverse direction, each of which was designated as 'recommended' , 'neutral' , or 'not recommended' based on p-convert values through which the SNP data sets were easily filtered. A SNP marker/strand is recommended if: p-convert > 0.6, no wobbles, and poly count = 0. In other words a SNP marker/strand was not recommended if they had duplicate count > 0 or poly count > 0 or p-convert < 0.4 or wobble distance < 21, or wobble count > = 3. Therefore, if a marker has the same recommendation for each strand, then it was tiled on strand with the highest p-convert value. None of the [A/T] or [C/G] markers were selected as they take up twice as many features. Finally, probes for selected SNPs were designed and successfully synthesized on the array chip.
Genotyping with the SNP array. Affymetrix GeneTitan ® platform was used to genotype "Reference Set" with the SNP array. Initially the target probes were prepared using each DNA sample having minimum quantity of 20 μ L of good quality DNA and 10 ng/μ L concentration. This procedure is explained in detail in Affymetrix Axiom ® 2.0 Assay Manual. These samples were then amplified, fragmented and hybridized on chip followed by single-base extension through DNA ligation and signal amplification. This procedure is explained in detail in Scientific RepoRts | 7:40577 | DOI: 10.1038/srep40577 Affymetrix Axiom ® 2.0 Assay Manual Target Prep Protocol QRC. The GeneTitan ® Multi-Channel Instrument was then used for staining and scanning the samples and the details are provided in (http://media.affymetrix. com/support/downloads/manuals/axiom_2_assay_auto_workflow_user_guide.pdf).
SNP allele calling and data analysis. Allele calling was done using Axiom ™ Analysis Suite version 1.0 using its three workflows i.e., Best Practices, Sample QC, Genotyping and Summary Only (http://media.affymetrix. com/support/downloads/manuals/axiom_analysis_suite_user_guide.pdf). We used 'Best Practices' workflow to perform quality control (QC) analysis of samples to select only those samples which passed the QC test for further analysis. The 'Sample QC' workflow was then used to produce genotype calls for the samples which passed QC analysis using 'Best Practices Workflow' . The 'Genotyping' workflow was used to perform genotyping on the imported CEL files regardless of the sample QC matrix. Before making the genotyping calls, samples not passing the QC were removed as their inclusion may reduce the quality of the analyzed results. Finally the 'Summary Only' workflow was used to produce a summary containing details on the intensities for the probe sets for use in copy number analysis tools. It also allows to export the SNP data after the analysis is completed for downstream analysis. We have analyzed the diploid and tetraploid genotypes separately keeping the DQC > 0.75 and call rates > 90. The above criteria helped in removing the SNPs having low call rates and keeping only the high quality SNPs for the further analysis. NR database was used to annotate all SNP containing genes using BLASTx program (http://blast.ncbi.nlm.nih.gov/blast/Blast.cgi?PROGRAM= blastx&PAGE_TYPE= BlastSearch&LINK_LOC= blasthome) with cut off E value < 1.0E-5. Further, these SNPs were annotated and their effect on gene function was predicted using SNPEff V4.2 34 software. For this, a file containing reference genome sequence in FASTA format and general feature format (GFF) file containing co-ordinates of various gene features such as, coding sequence (CDS), 5′ untranslated region (5′ UTR), 3′ untranslated region (3′ UTR), etc. were downloaded from the PeanutBase 35 and used to build genome database. The annotation of gene models identified from peanut genome has been described in Bertioli et al. 17 .
Diversity analysis. Called allelic data was used for studying genetic diversity and genetic relationship among individuals of the "Reference Set". The polymorphic information content (PIC), major allele frequency, number of observations, availability and gene diversity were calculated using the software PowerMarker ver. 3.25 36 .
Identification of subspecies specific high frequency SNPs. In order to identify subspecies specific SNPs, the genotypes from the 'Reference Set' were divided into three categories viz. fastigiata, hypogaea and wild accessions. The allele frequency at each SNP site within each category was calculated. Further, the allele frequencies at each SNP position were compared between three categories and SNPs with contrasting allele frequencies in all the three categories were identified. Additionally, the annotations of these SNPs and their effect on genes were predicted using SNPEff V4.2 34 and SNPs with moderate (missense mutations) and high (nonsense mutations) effect were identified. Further, the protein coding sequences of genes containing missense and nonsense SNPs were extracted and the functional annotation and gene ontology analysis was carried out using BlastGO 37 software to determine their involvement in particular biological function.