Development of a 690 K SNP array in catfish and its application for genetic mapping and validation of the reference genome sequence

Single nucleotide polymorphisms (SNPs) are capable of providing the highest level of genome coverage for genomic and genetic analysis because of their abundance and relatively even distribution in the genome. Such a capacity, however, cannot be achieved without an efficient genotyping platform such as SNP arrays. In this work, we developed a high-density SNP array with 690,662 unique SNPs (herein 690 K array) that were relatively evenly distributed across the entire genome, and covered 98.6% of the reference genome sequence. Here we also report linkage mapping using the 690 K array, which allowed mapping of over 250,000 SNPs on the linkage map, the highest marker density among all the constructed linkage maps. These markers were mapped to 29 linkage groups (LGs) with 30,591 unique marker positions. This linkage map anchored 1,602 scaffolds of the reference genome sequence to LGs, accounting for over 97% of the total genome assembly. A total of 1,007 previously unmapped scaffolds were placed to LGs, allowing validation and in few instances correction of the reference genome sequence assembly. This linkage map should serve as a valuable resource for various genetic and genomic analyses, especially for GWAS and QTL mapping for genes associated with economically important traits.


Results
SNPs identification and selection. SNPs were identified from a total of over 6.4 billion sequencing reads from 1,213 catfish (Table 1). Initially, a total of over 9.6 million putative SNPs were identified, including 8.6 million SNPs from channel catfish and 3.8 million SNPs from blue catfish. To select high-quality SNPs, 70 bp flanking sequences (35 bp upstream and downstream of the SNP site) of all the putative SNPs were mapped onto the catfish reference genome sequence and matched more than one locus were removed, resulting in approximately 5.7 million uniquely mapped SNPs from channel catfish and 3.1 million uniquely mapped SNPs from blue catfish. A total of ~1.8 million SNPs from channel catfish and ~300,000 SNPs from blue catfish were retained after additional steps of screening, including elimination of SNPs surrounded by simple sequence repeats, removal of tri-allelic SNPs, and removal of markers with low probability of conversion based on Affymetrix algorithms ( Table 2).
SNPs included on the 690 K array. The final SNPs included on the 690 K SNP array were summarized in Table 3. A total of 690,662 SNPs were selected, including 238,484 genic SNPs and 452,178 intergenic SNPs. The 238,484 genic SNPs were from 24,186 genes annotated from the channel catfish reference genome sequence, or from the transcripts assembled from RNA-Seq datasets. For most SNPs with unique flanking sequences, only one probe was designed on the array. However, two probes were designed on the array for 2,905 SNPs whose flanking sequences were less unique, leading to a total of 693,567 SNP probes on the 690 K SNP array.
The 690 K catfish SNP array also included species and strain specific SNPs (Table 3), which should be useful for genetic analysis of the interspecific hybrid system and intraspecific crossbreeding. Of the total SNPs on the array, 581,002 SNPs were specifically identified from channel catfish, 44,694 were exclusively observed in blue catfish, 19,124 were interspecific SNPs which were homozygous within but differing between species, and 45,842 SNPs that were heterozygous within and between species.
For strain identification, 48,434 SNPs from four catfish aquaculture strains that originate from different geographic locations were also included on the array. These four strains possess different production traits such as growth rates, disease resistance, and adaptation to environmental stresses. A total of 6,622 SNPs were specifically from a wild channel catfish population of the Coosa River, Alabama, which may be useful for genetic analysis of genomic regions with selective signatures. In addition to SNP probes, 2,000 data quality control (QC) probes were also included on the SNP array serving as negative controls. The SNPs included on the 690 K SNP array were of high quality, with an average p-convert value of 0.71; 97% of SNPs had a p-convert value of greater than 0.65 (Fig. 1A). A proportion of 89.6% of SNPs had a MAF greater than 0.1 (Fig. 1B). This distribution of MAF should provide the SNPs with a relatively high level of polymorphic content, a desired characteristic for linkage mapping, GWAS, and other genetic analysis.
The spacing among SNPs was evaluated using their physical position on the reference genome or on reference sequence contigs that were not mapped to chromosomes. As shown in Fig. 1C, over 86% SNPs had an inter-SNP spacing of less than 2,000 bp.
Distribution of SNPs across the genome. One of the most important goals of the SNP array development is to have a good coverage of the entire genome with relatively even distribution of SNPs. The SNP locations on the reference genome sequence were used as the coordinates to assess their overall distribution. A total of 659,912 (95.5% of all the SNPs on the array) SNPs were developed from the reference genome sequence, which span a total of 778 Mb, approximately 99.4% of the assembled reference genome sequence. As shown in Fig. 2, almost all of the genomic regions were covered by SNPs, except a few highly repetitive regions from which convertible SNP probes could not be designed. The remaining 30,750 (4.5%) SNPs were developed from other genomic resources, including 15,108 SNPs from unmapped scaffolds and contigs, 521 SNPs from transcript sequences via de novo transcriptome assembly, and 15,121 SNPs from bacterial artificial chromosome (BAC) end sequences (BES). Taken together, all the SNPs on the array covered 98.6% of the sequences in the reference genome scaffolds and 93.9% of the BAC based physical map contigs.
Performance of the catfish 690 K SNP array. Performance of the SNP array was examined by genotyping catfish DNA samples from hybrid backcross families and channel catfish domesticated families. As summarized in Table 4

Construction of channel catfish linkage map.
A total of 478 channel catfish samples of four mapping families were used for linkage mapping. After applying the criteria of DQC score greater than 0.82 and call rate greater than 97%, 5 samples with poor qualities were eliminated. Genotyping data of the remaining 473 samples were imported into Plink for a pedigree information test (Fig. 3). The vast majority of samples fell into three clusters, with two families gathered together because they were generated from one sire. Eight individual outliers were identified and excluded from linkage analysis. The genotyping data of remaining 465 samples were imported into Lep-map2 for SNP filtering prior to linkage group assignment. According to the assessment of genotyping quality and polymorphism in all samples from the four reference families, a total of 287,583 SNPs were informative in at least two families. A total of 287,370 qualified SNPs were successfully assigned into 29 linkage groups, which was in concordance with the number of chromosomes of the catfish haploid genome. A two-round marker ordering procedure were carried out with the four families simultaneously. The first round of marker ordering identified 116,864 representative markers. By using a hidden Markov model (HMM), the OrderMarkers module modeled recombinant haplotypes and identify duplicated markers. After filtering these duplicated markers, a second round of marker ordering was performed to improve the order of representative markers. Finally, the previously excluded duplicate and stacked markers were inserted back into the maker order to calculate the genetic distance. A total of 253,087 markers were placed onto the linkage map.
The sex-average genetic distances were calculated by taking account the recombination probabilities in both sexes. As summarized in Table 5, the sex-average map consists of 253,087 markers including 30,591 unique positions, with a total genetic length of 3,004.7 cM. The marker intervals estimated based on the unique marker positions ranged from 0.08 cM/marker pair in LG12 and LG13 to 0.13 cM/marker pair in LG22, with an average marker interval of 0.1 cM/marker pair in sex-average genetic map. As illustrated in Fig. 4, there were no abnormal large gaps on the genetic map. The detailed information on marker position is provided in Supplemental Table S1.
The sex-specific genetic distances were calculated by taking account the recombination rates in only one sex. The female genetic map consisted of 23,610 unique positions, with a total genetic length of 3,582.3 cM (Table 6   unique marker positions ranged from 0.11 cM/marker in LG12 to 0.17 cM/marker in LG22, with an average marker interval of 0.14 cM/marker on male genetic map. The difference between female and male linkage maps was assessed according to contingency G-test 67 . Significantly higher recombination rates were observed in the female genetic map than in the male genetic map,  for the majority of the linkage groups (p < 0.01). The female genetic map was 1,036.7 cM longer than the male genetic map, with an average female-to-male ratio of 1.4:1. The ratio varied by linkage groups, ranging from 0.96 in LG6 to 2.06 in LG18 (Table 6). Large differences of recombination rate were also observed in LG2 and LG23 with the ratios of female to male greater than 1.6.
Integration and validation with reference genome sequence. The genetic map anchored 1,602 scaffolds of the reference genome sequence, corresponding to 766 Mb (97.8%) of the total 783 Mb. Comparing with the results from using the catfish 250 K SNP array, 1,007 previously unmapped scaffolds are anchored to chromosomes. However, these scaffolds are small in size, resulting in anchoring of only additional 5.7 Mb of the reference genome sequences to chromosomes. Additionally, 15 transcripts generated from de novo transcriptome assembly and 244 previously unmapped BAC-based physical contigs are also placed onto the linkage map. The positions of previously unmapped markers are illustrated in Fig. 5, making the whole genome reference sequence more ordered, continuous and connected.
Mapping of over 250,000 SNPs across the genome validate the accuracy of the reference genome sequence assembly. As shown in Fig. 5, the marker orders from the linkage map and the reference genome are in concordance. A few differences of the marker orders between the linkage map and the reference genome sequences are observed in scf00172 (LG8), scf00249 (LG11), scf00274 (LG12), scf00439 and scf00436 (LG22) (Fig. 6), suggesting that these regions may be mis-assembled.

Discussion
In this study, we developed the catfish 690 K SNP array using Affymetrix Axiom technology. Compared to the catfish 250 K SNP array 30 , the 690 K array has significantly more markers, more even genomic distribution, higher qualities of SNPs based on p-convert values, greater level of polymorphic contents, and greater level of coverage for the whole genome. The final set SNPs on the array consisted of channel catfish specific, blue catfish specific, and inter-species SNPs. A subset of strain-specific markers from domestic and wild channel catfish populations were also included on the array, which should be useful for detecting genomic regions with selective signatures. The 690 K SNP covered 24,186 of the 27,618 predicted genes in catfish, including 23,876 genes identified from the reference genome sequences and 310 genes predicted from RNA-seq datasets. The remaining genes were not covered because most of these genes have more than one copy on the genome; with duplicated genes, it is difficult to design reliable SNP probes. The selected SNPs were relatively evenly distributed on the reference genome sequence, which should benefit the detection of linkage disequilibrium in future genome-wide association studies and fine-scale QTL mapping. Distribution of MAF indicated that most of the SNPs included on the array were common variants (89.6% of SNPs had a MAF > 0.1).
The evaluation of SNP array performance by genotyping samples from catfish backcross hybrid families and channel catfish families indicated that 72.7% of the SNPs were polymorphic in backcross hybrid catfish and 67.5% were polymorphic in channel catfish. Despite more channel catfish samples processed, higher percentages of SNPs were converted and polymorphic in backcross hybrids samples. This may be caused by the interspecific and blue-specific SNPs included on the array. Alleles from blue catfish could be detected as they segregated in the hybrid backcross genomes.
Taking advantage of the catfish 690 K SNP array, we constructed a genetic linkage map with super high density of markers for channel catfish. Not only the total number of mapped markers were increased from 50,000 to 250,000, the resolution of the map was also increased by using four reference families with a total of 465 individuals. However, because of the large number of markers and the limited recombination along the chromosomes, a large number of stacked markers were still observed. The only way to increase resolution power is to increase the sample sizes of the reference population (numbers of individuals and number of the reference families). While that can be done, the primary limitation is the cost. Due to the presence of genotyping errors and missing values, genetic mapping with these closely linked markers greatly increase the computation burden and usually introduce an overestimated genetic sizes. In our study, the total genetic distance and error estimate were dramatically reduced when we filtered the stacked SNPs. Genotyping errors in duplicate markers could also deform the marker orders, which can cause more severe problems compared with the inflation of genetic distances. To reduce the effect of clustered markers on the map construction and reduce the computing burden, we selected one representative marker to anchor the clusters. After the marker order was settled, the duplicated markers were then rejoined into the linkage group. This procedure rescued the informative markers onto the linkage maps based on the positions of their representative anchor marker. Such clustered markers were still valuable to organize the scaffolds to chromosomes using the reference genome sequences.
By integration of the genetic map with the reference genome sequences, the patterns of recombination across the whole chromosomes could be determined. As shown in Fig. 5, mild to strong localized specific recombination patterns were observed in each linkage group. Most often, the recombination rates were usually elevated towards the ends and decreased in the middle of the chromosome. Stacked markers that located in regions of strong linkage disequilibrium were observed in each linkage group, especially in regions close to the centromeres. The observed overall recombination patterns were in concordance with previous research in catfish 26,28 and other aquaculture species such as tilapia 68 , medaka 69 , Atlantic salmon 70 , and rainbow trout 71 . Although the ultimate mechanism is still elusive, increasing evidence supports the hypothesis that multiple functional sequence motifs   Table 6. Summary of the sex-specific linkage map of channel catfish. are involved in recombination regulation 72 . With the available of channel catfish reference genome, the next step of our study is to identify the sequence features within the recombination "hot zones". Higher recombination rate is observed in females than in males, which is consistent with previous studies in channel catfish. This phenomenon has also been observed in species with dissimilar sex chromosomes, with larger recombination rate in homogametic sex than in heterogametic sex. Some examples of this include mice 73 , zebrafish 74 , Atlantic salmon 35 , rainbow trout 71 , European seabass 75 , silver carp 76 , and grass carp 77 . One hypothesis toward sexual dimorphism in recombination fraction is that sex chromosomes in the homogametic sex are equal in size, therefore, recombination is more likely to occur. However, this does not seem to be true for channel catfish because the X and Y chromosomes are cytologically indistinguishable 78 . Interestingly, our results showed that recombination rate of LG4, which corresponds to the sex chromosome, is similar to that of other linkage groups, perhaps with the exception of the sex determination region. This suggests that other factors such as male-specific selection 79 or chromatin differences may account for this difference 80 .
The high-density linkage map allowed validation and correction of the reference genome. In LG8, a 1.2 Mb stretch (from 9,958,308 bp to 11,156,616 bp) of scf00172 (NCBI accession KV452904.1) is supposed to be placed reversely between scf00174 and scf00637. In LG11, a 2.5 Mb subsequence (from 5,088,016 bp to 7,539,243 bp) of scf00249 (NCBI accession KV452863.1) is assumed to be assembled in the opposite direction. In LG12, a 1.8 Mb subsequence (from 3,480,035 bp to 5,255,031 bp) of scf00274 (NCBI accession KV453011.1) is presumed to be assembled in the opposite direction. It is also surmised that the whole sequence of scf00439 (NCBI accession KV453114.1) should be assembled reversely in LG22. Scf00436 (NCBI accession KV453115.1) is supposed to be placed between scf00437 (NCBI accession KV453116.1) and scf00435 (NCBI accession KV453121.1). By integrating the linkage map with the reference genome assembly, over 5.6 Mb from 1,007 previously unmapped scaffolds were successful anchored to their corresponding positions. Additionally, 15 transcripts generated from de novo transcriptome assembly, and 244 previously unmapped BAC-based physical contigs were also placed onto the linkage map. This is a significant improvement on integration of the linkage map and the reference genome sequence, which is useful for further genomic studies, QTL analysis, and whole genome-based selection.

Materials and Methods
Ethics statement. This study was approved by the Institutional Animal Care and Use Committee (IACUC) at Auburn University. All experiments involving the handling and treatment of fish were carried out in accordance with approved guidelines.

SNP identification and SNP array development.
In order to identify SNPs from the whole genome, Illumina sequencing data from various studies involving fish with diverse genetic background were collected. For channel catfish, a total of 3.3 billion reads from RNA-seq of 824 fish and 2.4 billion reads from whole genome sequencing of 150 fish were collected for SNP identification, with an average genome coverage of 500X (Table 1). For blue catfish, data used for SNP identification included over 359 million reads of GBS data from 190 individuals and 478 million reads of RNA-seq data from 49 individuals.
To reduce sequencing artifacts and improve the SNP quality, raw sequencing reads from all the studies were first subjected to quality control with Trimmomatic (version 0.33). Adaptor sequences, ambiguous nucleotides (N's), extreme short reads (< 25 bp) were removed. Low quality bases were identified and trimmed with a sliding window method, bases within a window size of 4 were cut once the average quality was less than 15. For reads generated from GBS, chimeric sequences were eliminated by trimming the sequence at the corresponding restriction enzyme site. The clean reads generated by WGS and GBS were then aligned to the reference genome sequence with BWA-MEM (version 0.7.12). To acquire high sensitivity and accuracy for RNA reads alignment, a 2-pass alignment method was performed using STAR aligner (version 2.4.0j). The results of the alignment were exported in BAM format for subsequent analysis 81 .
Prior to SNP identification, the BAM files were processed with Picard tools (version 1.119) to identify and remove redundant copies of duplicates. BAM files of WGS reads were subjected to local realignment of regions near INDELs with GATK (version 3.3) to improve the accuracy of variant calling. For BAM files of RNA-seq reads, "SplitNCigarReads" commands of GATK were used to split reads which spanned multiple exons and to trim overhangs. Ambiguously mapped reads were removed using SAMtools (version 0. 1.19) under the criteria of a minimum MAPQ score of 20. For paired-end reads, only those mapped in a proper pair were kept for further analysis. Subsequently all alignment files were integrated for variant calling using Varscan (version 2.3.7). The putative SNPs were identified with the thresholds of minor allele frequency greater than 0.05, minimum read base quality of 20, strand-filter of 90%, and minimum read depth of 10. Sequences of 71-bp spanning each SNP were extracted, with 35-bp upstream and 35-bp downstream of the SNP base, respectively. To avoid false positive SNPs caused by ambiguous mapping of duplicated regions, the 71-bp fragments were aligned to the reference genome sequence with BLAST + (version 2.2.29). SNPs with flanking regions that mapped to multiple sites or low complexity and repetitive regions on the reference genome sequence were removed.
SNP selection was performed in multiple steps using different criteria regarding different SNP types. All the filtration parameters were set with the aims of achieving an evenly spaced coverage across the entire genome and removal of false positive sites. All the original SNPs were classified into different groups and selected in a certain order: SNPs within genes were first selected, then SNPs from intergenic regions were added, finally, species-specific SNPs and strain-specific SNPs were included in the pool of candidate SNPs. Custom scripts were used to perform the selection according to the following criteria: (1) To avoid non-specific hybridization, the flanking sequences of selected SNPs should not have other SNPs or simple nucleotide repeats; (2) For practical application in SNP genotyping assays, only bi-allelic SNPs were selected; (3) To acquire high-polymorphic rate, SNPs with a minor allele frequency greater than 0.1 were preferentially selected; (4) To acquire a high-capacity, A/T or C/G SNPs were not selected unless absolutely needed; (5) To minimize the effects of GC content on the signal intensity, the GC percentage of the SNPs flanking sequences should be between 30-70%. For SNPs within genes, once a new SNP was included into the candidate pool, SNPs from the flanking regions of 200 bp were not added. For SNPs from intergenic regions, once a new SNP was included into the candidate pool, SNPs from the flanking regions of 350 bp were not added. SNPs from unmapped scaffolds and contigs apart from reference genome sequences were also included regardless of their distances.
All catfish SNPs that passed this filter were submitted to Affymetrix Bioinformatics Services for in-silico probe converting test, where the performing quality of the SNPs were evaluated. Upstream and downstream probes flanking the SNPs were assigned with a p-convert value (0.0 to 1.0), respectively. Probes with high p-convert values were more likely to be successfully genotyped. A threshold for p-convert value was set to remove the lowest performing probes to facilitate selection of a high-quality SNP list. Markers with at least one probe that passed the p-convert value threshold were retained. For SNP markers with both of the probes that passed the p-convert value threshold, the probe with greater p-convert value was selected. For SNPs with only probes of low p-convert values, both probes were included to cover the genome region.
In addition to the polymorphic SNPs, 2,000 probes generated from non-polymorphic genomic regions were also introduced as QC probes of which 1,000 probes were selected with A or T at the 31st base, and 1,000 QC probes were selected with G or C at the 31th base. The QC probes along with the final list of SNPs were submitted to Affymetrix for fabrication of Axiom GW genotyping array. SNP array performance evaluation. A total of 480 catfish were genotyped to assess the performance of SNP array, including 396 purebred channel catfish of the Delta Select line provided by USDA-ARS Warmwater Aquaculture Research Unit, and 84 hybrids generated by backcrossing channel x blue interspecific hybrid females with male channel catfish.
DNA samples were prepared following the procedures as previous described 27 . DNA samples were diluted to 50 ng/μ l and genotyped with the catfish 690 K SNP array (GeneSeek, Lincoln, Nebraska, USA). The signal intensity data of each probe on the array were reported in CEL files, which were analyzed with Axiom Analysis Suite (version 1.1.0.616) for quality control and genotype calling. Samples with a Dish value greater than 0.85 and SNP call rates greater than 95% were retained for subsequent analysis.
Following the genotyping step, SNPolisher (Affymetrix) generated quality metrics and classified all the SNPs into six types. Briefly, "PolyHighResolution" indicate that both of the two alleles of a SNP were detected. The signal data of all the samples also formed into three distinct clusters with good resolution; "NoMinorHom" indicate SNPs with two clusters of signal data, with no example of the minor homozygous genotypes; "MonoHighResolution" include SNPs with only one clusters identified. "OTV" refers to off-target variants, indicating SNPs with an OTV cluster that caused by sequence dissimilarity between probes and target genome regions 82 . "CallRateBelowThreshold" were the SNPs with call rates below threshold, but other cluster properties were above threshold. "Other" were the SNPs with one or more cluster properties below the threshold. In Scientific RepoRts | 7:40347 | DOI: 10.1038/srep40347 most cases, SNPs classified as "PolyHighResolution", "NoMinorHom", "MonoHighResolution", were considered as converted SNPs. SNPs classified as "PolyHighResolution" and "NoMinorHom" were considered as polymorphic SNPs.
Linkage map construction. A total of 478 individuals from four full-sib families of the Delta select strain of channel catfish were used for linkage mapping. SNPs classified as "PolyHighResolution" and "NoMinorHom" were retained for further analysis. The genotyping results were exported in the pre-MAKEPED LINKAGE pedigree format 83 .
The genotyping data were imported into Plink (version 1.9) to test pedigree information. A complete linkage agglomerative clustering procedure based on pairwise identity-by-state (IBS) distance were carried out. Multidimensional scaling analysis was performed on the generated IBS pairwise distances matrix. The pedigree information of outliers was identified and checked by subsequent outlier detection diagnostics, where a Z score was assigned to measure the distance between the outliers with the rest of the samples. Outlier samples were discarded once they were detected with significantly larger distances compared with the normal level.
Linkage map was constructed using Lep-MAP2 (version 0.2) 84 . First, the Filtering module was executed to filtering low quality and un-informative markers. Markers with missing values larger than 12 (about 10%) or MAF less than 6 (about 5%) in each family were discarded. Only markers with 2 or more informative families were retained. A segregation distortion test was also performed to compare the offspring genotype distribution and the expected Mendelian proportions. Markers with significant segregation distortion were eliminated from linkage analysis (χ 2 test, P < 0.005). SNP Markers were assigned to linkage groups (LGs) using the SeparateChromosomes module.
LGs were formed according to the threshold of logarithm of the odds (LOD) score limit of 35 and minimum LG size of 10. Singular markers were then added to the established LGs using the JoinSingles module with an LOD score limit of 10 and a minimum difference of 3 between the best LG and the second best LG of each joined marker.
Marker order of each LGs was determined by allowing different recombination probabilities in both sexes. Genotyping data of markers from the four families were analyzed simultaneously. Two-rounds of marker ordering procedures were carried out for better performance. In the first round, ten iterations were performed to acquire the best order of markers. To reduce the computational burden, a missing rate of 5% was set when determining whether two markers were duplicates. As the number of markers may beyond the resolution of recombination, there were many markers stacking up in the same locus on the genetic map, which may lead to recombination rate deformities. Therefore, the stacked markers as well as the duplicated markers were clustered and filtered. The marker with the most informative meiosis of one cluster was selected as the representative marker and retained for a new round of marker ordering. After the second round of ordering, all the previously identified duplicates and stacked markers were added to the adjacent locus and included for genetic distance calculation with Kosambi mapping function by taking account both male and female meiosis. Sex-specific recombination rates were then calculated with the same marker order. MapChart (version 2.3) was used to graphically present the genetic linkage map.