The common carp, Cyprinus carpio, is one of the most important cyprinid species and globally accounts for 10% of freshwater aquaculture production. Here we present a draft genome of domesticated C. carpio (strain Songpu), whose current assembly contains 52,610 protein-coding genes and approximately 92.3% coverage of its paleotetraploidized genome (2n = 100). The latest round of whole-genome duplication has been estimated to have occurred approximately 8.2 million years ago. Genome resequencing of 33 representative individuals from worldwide populations demonstrates a single origin for C. carpio in 2 subspecies (C. carpio Haematopterus and C. carpio carpio). Integrative genomic and transcriptomic analyses were used to identify loci potentially associated with traits including scaling patterns and skin color. In combination with the high-resolution genetic map, the draft genome paves the way for better molecular studies and improved genome-assisted breeding of C. carpio and other closely related species.
Carp (cyprinids) contribute over 20 million metric tons to fish production worldwide and account for approximately 40% of total global aquaculture production and 70% of total freshwater aquaculture production. They have emerged as the most economically important teleost family. In comparison to other major aquaculture species, such as salmon and shrimp, carp are recognized as an ecofriendly fish because most are omnivorous filter-feeders and thus consume much less fish meal and fish oil. As one of the dominant cyprinid species, C. carpio (the common carp) is cultured in over 100 countries worldwide and accounts for up to 10% (over 3 million metric tons) of global annual freshwater aquaculture production1,2. In addition to its value as a food source, C. carpio is also an important ornamental fish species. One of its variants, koi, is the most popular outdoor ornamental fish because of its distinctive color and scale patterns.
Most teleosts have undergone a teleost-specific genome duplication (TSGD) and contain 24 to 25 chromosomes in their haploid genome. The haploid genome of C. carpio has 50 chromosomes3, and molecular evidence suggests that an additional whole-genome duplication (WGD) event tetraploidized the genome4,5,6,7. Although cytogenetic evidence of the allotetraploidization of C. carpio has suggested that 50 bivalents rather than 25 quadrivalents are formed during meiosis6, genome-scale validation is of great importance. Owing to its economic value in aquaculture, C. carpio has been intensively studied in terms of its physiology, development, immunology, disease resistance, selective breeding and transgenic manipulation. In addition, it is also considered an alternative vertebrate fish model to zebrafish (Danio rerio). A variety of C. carpio genome resources have been developed over the past decade, including a large number of genetic markers8,9, genetic maps10,11,12,13, a BAC-based physical map14,15, a large number of ESTs16,17,18 and cDNA microarrays19. Recently, a comparative exomic study of C. carpio and D. rerio has been reported, providing additional genome resourcing data for the research community20.
Using a whole-genome shotgun strategy and combining data from several next-generation sequencing platforms, we have produced a high-quality genome assembly for C. carpio (strain Songpu) and completed the genomic resequencing of 33 C. carpio accessions that represent major domesticated strains and populations. In addition to comparative and evolutionary studies of C. carpio and its closely related species using the genome sequences, we also demonstrate the genetic basis of phenotypic traits on scale patterning and body color determination, on the basis of data from two distinct domesticated strains (Songpu and Hebao). This study on the C. carpio genome provides a valuable resource for the molecular-guided breeding and genetic improvement of the common carp.
Sequencing and assembly
We prepared genomic DNA from a homozygous double-haploid clonal line from the domesticated strain Songpu, which has a documented breeding history. We performed whole-genome shotgun sequencing on three next-generation sequencing platforms, Roche 454, Illumina and SOLiD, using both single-end and paired-end or mate-pair libraries of various insert size ranging from 250 bp to 8 kb (Supplementary Table 1). Our contig assembly was based on single-end pyrosequencing data (CABOG, Celera Assembler with Best Overlap Graph), and the scaffold assembly was based on paired-end and mate-pair sequences from different sequencing platforms and 29,046 paired BAC-end sequences14,15. After gap filling, the contig and scaffold N50 lengths reached 68.4 kb and 1.0 Mb, respectively. The total length of all scaffolds was 1.69 Gb (Table 1). We estimated the genome size to be 1.83 Gb on the basis of K-mer analysis, which is consistent with estimates based on cytogenetic methods3,21 (Supplementary Fig. 1). Thus, the scaffolds covered at least 92.3% of the genome (90.2% if we excluded sequence gaps of 40 Mb in length) and 90% of the assembly containing 2,503 large scaffolds, for a total length of 1.53 Gb.
To validate the genome assembly, we mapped all paired-end and mate-pair reads from different sequencing platforms to the assembly and found that an average of 80.3% of the reads (78.1% of Illumina reads, 74.6% of Life Technologies SOLiD reads, 98% of Roche 454 reads and 98.8% of BAC-end reads) could be mapped (Supplementary Fig. 2 and Supplementary Table 2). To assess the accuracy of the assembly, we aligned our assembly to an assembled BAC and five large scaffolds from previously published genome sequences20; the result demonstrates high consistency between the two data sets (Supplementary Fig. 3). To assess gene coverage, we mapped assembled transcriptome sequences, including publically available ESTs and new mRNA reads from multiple tissues, to the assembly. The effort yielded ∼88.8% coverage of these transcripts by nucleotide sequence similarity (Supplementary Table 3); of all of the mapped genes, 90% were common among the sequenced teleosts (Supplementary Table 4). Owing to the multiple rounds of GWD, C. carpio genes are rich in paralogs, which are thought to interfere with assembly. Therefore, we mapped 19 duplicated genes that were shared among teleosts to the genome and found that 16 of the gene pairs mapped to distinct locations whereas the other 3 collapsed into single genes, given the high similarity among the paralogs (Supplementary Table 5).
Genetic map and markers
Our attempt to anchor the genome assembly onto a newly updated high-resolution genetic map, constructed using a genetic mapping panel of 107 full siblings produced from the cross of a Songpu pair (Supplementary Note), succeeded in placing a total of 3,470 high-quality SNPs and 773 microsatellite markers (Supplementary Table 6), which clustered into 50 linkage groups (Supplementary Fig. 4) and covered a genetic distance of 3,946.7 cM. The physical coverage of these linkage groups contained 1,456 of the longest scaffolds (∼875 Mb in total length and 16.7 Mb of gaps) (Fig. 1 and Table 1), and the ratio of the median genetic distance to the physical distance was 0.2 Mb/cM (Supplementary Figs. 5 and 6).
The C. carpio genome has a GC content of 37.0%, slightly higher than that of D. rerio but much lower than that of other sequenced teleost genomes (Supplementary Fig. 7 and Supplementary Note). To identify transposable elements (TEs)22, we constructed a C. carpio–specific repeat database. We found that 529 Mb of the assembled contigs (31.3% of the genome assembly) could be attributed to TEs (Table 1 and Supplementary Tables 7 and 8). This proportion of the content is higher than that for most of the sequenced teleost genomes (7.1% in Takifugu rubripes23, 5.7% in Tetraodon nigroviridis24, 30.68% in Oryzias latipes25 and 13.48% in Gasterosteus aculeatus26) but lower than that for the D. rerio genome (59.78%) (Supplementary Table 9). Of the TEs, the fraction of class I TEs (retroelements) was 9.99% of the total genome assembly (4.90% long interspersed nuclear elements (LINEs), 4.35% long terminal repeats (LTRs) and 0.47% short interspersed nuclear elements (SINEs)), whereas that of the class II TEs (DNA transposons) was 17.53%. The most abundant DNA transposon family identified in the C. carpio genome was the hAT superfamily, which had approximately 463,000 copies and accounted for 33% of all identified DNA transposons (consistent with our previous findings from the analysis of BAC-end sequences15). The distribution of the divergence rates for the TEs peaked at 6% in C. carpio and at 8% in D. rerio, suggesting a more recent expansion of these elements in the C. carpio genome (Supplementary Fig. 8).
We used a comprehensive strategy to annotate C. carpio genes by combining ab initio gene prediction (FGENESH and AUGUSTUS), protein-based homology (Supplementary Note) and transcript-based evidence (transcriptomes from multiple tissues and developmental stages) (Supplementary Table 10). All predicted gene structures were integrated with EVidenceModeler (EVM)27 to yield a consensus gene set containing a total of 52,610 protein-coding genes, of which 91.4% were proven to be expressed (Table 1, Supplementary Fig. 9 and Supplementary Tables 11 and 12). This gene number is almost twice that found in D. rerio, confirming the fact that the tetraploid genome retained a large portion of its gene duplicates after the latest WGD. The average gene and coding sequence lengths were 12,145 bp and 1,487 bp, respectively, and C. carpio genes had an average of 7.48 exons per gene (Supplementary Fig. 10 and Supplementary Table 13). In addition, the non-protein-coding genes included 1,012 rRNA, 3,622 tRNA and 914 microRNA (miRNA) genes (Table 1 and Supplementary Table 14).
C. carpio has 100 chromosomes, approximately twice as many as are found in other cyprinid fish species. Many studies have corroborated the occurrence of either TSGD or the third round of WGD in most ray-finned fishes28,29,30,31,32 and have predicted that TSGD has facilitated the evolutionary radiation and phenotypic diversification of the teleost fishes29,31. The C. carpio genome is believed to have undergone an additional round of genome duplication (4R) and to have thus tetraploidized4,5,6,33. We have identified approximately 50 chromosome bivalents rather than quadrivalents in the meiotic nuclei of C. carpio, suggesting that it is not a true tetraploid species according to karyotyping. The tetraploidy observed in C. carpio seems to result from allotetraploidization (species hybridization) rather than autotetraploidization (genome doubling)6. We aligned 52,610 high-confidence gene models to the 50 C. carpio chromosomes and the D. rerio genome (n = 25 chromosomes) and identified 8,002 orthologous gene pairs with a clear two-to-one orthologous relationship between the two species, respectively (Fig. 2a). The major obscure synteny found on the long arm of D. rerio chromosome 4 is actually in accordance with a recent report that highlighted unique features of this region34: the region shows little orthology with other sequenced teleost genomes and harbors zebrafish-specific gene duplication and a high-density small nuclear RNA (snRNA) cluster that accounts for 53.2% of all snRNAs in the genome. This region most likely emerged in D. rerio after Danio-Cyprinus divergence. In addition, we also observed a number of minor chromosome rearrangements on the carp chromosomes, including on the long arm of chromosome 8 (showing weakened orthology with D. rerio chromosome 4) and the region containing orthologs with D. rerio chromosome 17. We also identified 2,114 best-match reciprocal paralogous gene pairs and built ohnologous blocks on 25 paired chromosomes. A circular representation of ohnolog pairs clearly demonstrates their one-to-one syntenic relationship (Fig. 2b), consistent with previous observations for genome tetraploidization.
To further provide insight into the tetraploid nature of the genome at the gene level after the 4R WGD event, we investigated the hox gene clusters in C. carpio. This species has almost twice the number of hox clusters as D. rerio35 and the same number of hox gene clusters as the Atlantic salmon (Salmo salar)36, which is an autotetraploid species37 (Supplementary Figs. 11 and 12, and Supplementary Note).
To determine the date of the C. carpio WGD event (4R), we used a total of 5,783 gene families and calculated their synonymous substitution rates (Ks values) (Fig. 2c and Supplementary Note). On the basis of a Ks rate of 3.51 × 10−9 substitutions per synonymous site per year5 and the obtained Ks value of 0.03, we estimated that the latest WGD (4R) happened 8.2 million years ago, a date more recent than the predictions suggested in previous reports5,7. The carp-zebrafish paralogous genes displayed a distinct peak (Ks = 0.45) that corresponded to a divergence time of 128 million years ago. In combination with the duplication time and divergence time predictions, these data suggest that the latest WGD event (4R) occurred long after C. carpio and D. rerio split. Similarly, an analysis of fourfold synonymous third-codon transversion (4dTv) provided additional evidence for an extra round of WGD (Fig. 2d). C. carpio and D. rerio had a peak in common (4dTv = 0.58), which corresponds to the TSGD event (3R). An extra 4dTv peak within the C. carpio paralogous genes (4dTv = 0.1) corresponds to the latest carp-specific WGD (4R).
The annotated gene models of the C. carpio genome are substantially better than those of other completely sequenced fish genomes. To understand the evolutionary relationship of the 52,610 gene models with those of other vertebrates, we performed systematic cross-species comparative analysis and classified the genes according to their similarities. We first used five teleosts (C. carpio, D. rerio, T. rubripes, O. latipes and G. aculeatus), six tetrapods (Homo sapiens, Mus musculus, Sus scrofa, Gallus gallus, Anolis carolinensis and Xenopus tropicalis) and Ciona intestinalis (outgroup) for the comparison (Fig. 2e). We identified 941 single-copy orthologs that were conserved among all investigated species, which only accounted for 1.8% of the predicted gene models of C. carpio. Second, we constructed the species phylogeny using a maximum-likelihood approach with multiple alignments of single-copy orthologs. The remaining gene models (98.2%) were more complex and included many-to-many orthologs (28.0%), non-uniformly occurring, patchy orthologs (11.0%) and undetectable models (6.1%). Third, the predicted C. carpio gene models corresponded to orthologous genes in D. rerio (8,002 orthologous genes), including 2,037 (3.9%) cyprinid-specific gene models. Fourth, we also identified 3,212 species-specific gene models of C. carpio that did not have any homologs in the 10 other vertebrates and the Ciona species examined. This number is higher than the number of species-specific genes in the D. rerio genome, suggesting that a significant number of novel genes were generated in C. carpio after the divergence of C. carpio and D. rerio, likely owing to the latest WGD and independent gene evolution (Supplementary Table 15 and Supplementary Note).
C. carpio, as a genetically diverse and successful species, has adapted to various environments across a broad ecological spectrum in Eurasia and has been domesticated for more than 2,000 years. This species has been bred into numerous strains and local populations, producing distinct phenotypic changes in its growth rate, temperature and hypoxia tolerance, body color, scale pattern and body shape, which are partially attributable to genome diversity due to its two WGD (3R and 4R) events31. To investigate its genetic variation, we selected 33 representative C. carpio accessions for genome resequencing, which included 13 accessions of 4 wild populations from the Danube River, the Yellow River, the Heilongjiang (Amur) River and the Chattahoochee River and 20 accessions of 6 domesticated strains from Asia and Europe (including Songpu, Xingguo red, Oujiang color, Hebao, Szarvas 22 and koi) (Supplementary Table 16). With a total of 4,176 million paired-end reads (101-bp read length, 417.6 Gb in total length and 229-fold coverage of the genome; Supplementary Table 17), we identified 18,949,596 candidate SNPs and 1,694,102 small insertion-deletions (indels) (Supplementary Table 18).
To investigate the divergence of the representative C. carpio accessions from diverse geographical habitats and domestic histories, we constructed phylogenetic trees on the basis of the sequence variations (Fig. 3a and Supplementary Fig. 13). It was obvious that the European and Asian accessions formed two distinct clades. One of the strains, Songpu, was also grouped into the European clade as it was bred from mirror carp originally introduced from Europe in the 1950s. Our principal-component analysis (PCA) yielded a similar result (Fig. 3b), showing Asian accessions as a tight cluster that was separate from the European accessions. We further analyzed the population structure using the Bayesian clustering program STRUCTURE38. Because the values of ln likelihood were distinctively high for the models K = 3 and 4 (Supplementary Fig. 14), we show the clusters of K = 3 and 4 in Figure 3c. Almost all the accessions either had common ancestry or showed a single origin of the Eurasian population, and the results agree with the hypothesis that modern C. carpio evolved from the Caspian Sea ancestor and spread into Europe and the eastern mainland of Asia39. There were no uniform patterns covering all the populations, with the exception of two extremely isolated wild populations, Heilongjiang and Oujiang; in other words, extensive genetic admixture has been occurring in both the wild and domesticated C. carpio populations. For instance, Songpu carried admixture from the Asian population (K = 3), a finding supported by the recent history of introgression after its introduction to China. We also observed that the US accessions separated into both the European and Asian clades, and the trend indicates multiple introductions to North America from both Europe and Asia. This observation is also supported by our PCA and population structure analyses.
We performed a further genetic diversity scan comparing the Hebao and Songpu genomes to identify highly different genomic regions. Hebao is one of the typical strains derived from East Asian subspecies (C. carpio haematopterus), whereas Songpu is the strain derived from mirror carp of European subspecies (C. carpio carpio) (Fig. 4a). We predict that these two varieties retain substantial genetic differences, given their distinct body shapes, scale morphogenesis and patterns, and skin color phenotypes. We identified a total of 205 genome regions with the highest (top 1% of πHebao/πSongpu) genetic diversity (12.67 Mb in length) containing 326 candidate genes (Supplementary Tables 19 and 20). Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis indicated that a significant portion of these candidate genes were associated with epithelial morphogenesis, hair follicle morphogenesis, pigmentation and immune response, including adherens junction signaling, signaling by Rho family GTPases, tight junction signaling, prolactin signaling, fibroblast growth factor (FGF) signaling, interleukin (IL)-6 signaling and other functional pathways (Fig. 4b and Supplementary Table 21). The results are consistent with the phenotype differences in scale pattern and skin color observed for Hebao and Songpu. We investigated the candidate genes and detected 82 genes and 106 genes that harbored nonsynonymous SNPs in the Songpu and Hebao genomes, respectively (Supplementary Tables 22 and 23). We also identified two Songpu genes (fgfr1a1 and lrrc72) and two Hebao genes (zpld1 and nlk) that harbored deletions in coding regions (Supplementary Figs. 15 and 16). All these instances of sequence diversity altering protein-coding sequences provide candidate loci for assessing phenotypic differences between Hebao and Songpu. Notably, the fgfr1a1 gene (encoding FGF receptor 1 a1) on chromosome 34 of the Songpu genome contained a 306-bp specific deletion in intron 10 (228-bp deletion) and exon 11 (78-bp deletion) (Fig. 4c). The deletion had previously been reported as the causative mutation for scale loss and reduction in the mirror carp40. Extensive investigation on large samples from four strains confirmed that the deletion was only found in Songpu (Supplementary Fig. 16).
Comparative analysis of the skin transcriptome
To further elucidate the differences in the scale pattern and skin color of Hebao and Songpu, we performed a comparative analysis of the skin transcriptome in both strains using a deep RNA sequencing (RNA-seq) approach. We identified 894 differentially expressed transcripts, including 567 upregulated genes in Hebao and 327 upregulated genes in Songpu. The experiment was validated with quantitative RT-PCR (qRT-PCR) on selected genes (Supplementary Fig. 17). Further analysis showed distinct expression patterns in Hebao and Songpu for many genes associated with the Wnt/β-catenin signaling pathway (Supplementary Table 24), which is an essential pathway in initiating hair follicle formation41. Both mammalian hair and teleost scales are skin appendages, and their formation involves similar developmental pathways. We inferred that the gene expression differences in these two different carp populations were correlated with the reduced-scale phenotype in Songpu and the full-scale phenotype in Hebao.
We also observed a difference in the expression of the slc7a11 gene (encoding solute carrier family 7 member 11), the plasma membrane cystine/glutamate exchanger (xCT) that transports cystine into melanocytes. In the melanogenesis pathway, tyrosine is oxidized to form dopaquinone, which is then intracellularly catalyzed to become eumelanin (brown to black pigment) through polymerization and oxidation reactions. However, cystine and dopaquinone can switch off the eumelanin synthesis pathway and promote the synthesis of pheomelanin (yellow to red pigment)42,43. slc7a11 was expressed at a higher level in Hebao than in Songpu, suggesting that more cystine is transported into the pigment cells in Hebao, resulting in the preponderant synthesis of pheomelanin in the skin of Hebao while eumelanin synthesis is suppressed. Higher pheomelanin accumulation in the pigment cells gives Hebao its red skin appearance (Fig. 4d). However, the genetic basis for differences in slc7a11 expression remains unclear, and further investigation will be necessary to understand the overall role of slc7a11 in color variation.
As one of the most representative carp species, C. carpio had a value and global production in 2011 of $5.31 billion and 3.73 million tons, respectively (FAO statistics; see URLs), and the importance of C. carpio has been increasing over the past decade. The species is also widely cultured as an ornamental fish because of its various color and scale patterns. We sequenced and assembled the C. carpio genome from the genome of a gynogenetic individual using multiple next-generation sequencing platforms and a hybrid assembly strategy. The draft genome provides an important genomic resource to study the genetic basis of economically important traits in carp and to facilitate genome-based genetic breeding technologies in common carp aquaculture. The draft genome also provides insight into the latest WGD event of allotetraploidization that occurred approximately 8.2 million years ago, doubling the chromosome number and gene content of C. carpio. The whole-genome resequencing of selected accessions also offers a glimpse into the phylogenetic relationship and population structure of major global accessions of the C. carpio population. Comparison of the genomic diversity of two distinct strains, Songpu and Hebao, coupled with additional transcriptomic studies, has allowed us to identify genetic loci and to determine the molecular basis of scale patterns and skin colors, providing a foundation for further studies using comprehensive approaches to completely define the mechanisms underlying these phenotypes. Thus, the draft genome assembly presented here provides a valuable resource for genetic, genomic and biological studies of C. carpio and for improving the aquaculturally important traits of farmed C. carpio and other key cyprinid species in aquaculture.
This study was approved by the Animal Care and Use Committee of the Centre for Applied Aquatic Genomics at the Chinese Academy of Fishery Sciences.
Genome sequence and assembly.
A gynogenetic Songpu C. carpio was selected as the genomic DNA source for whole-genome sequencing. We constructed 21 shotgun libraries and an 8K mate-pair library according to Roche 454 standard operating procedures. The 22 libraries were sequenced on a Roche 454 genome sequencer using GS FLX Titanium chemistry. We also constructed six paired-end libraries by following the Illumina procedure. Paired-end sequencing of each library was performed on an Illumina HiSeq 2000 instrument to produce the raw data. We then filtered out low-quality and short reads to obtain a set of usable reads. Another library with an 8-kb jumping distance was generated and was sequenced on the SOLiD platform. The published 65,720 clean BAC-end sequences were collected for genome assembly. We assembled the Roche 454 read data set and the Sanger BAC-end sequences into contigs using the Celera assembler44. Reads from the Illumina libraries, the SOLiD libraries and the 8,000 Roche 454 mate-pair libraries were aligned to the genomic sequences, and paired-end relationships between the reads were used to construct scaffolds. BAC-end sequences were mapped to the scaffolds and were used for further scaffolding. Finally, we used the paired-end information from the short paired-end reads to fill the gaps between the scaffolds with Gapcloser45.
Linkage mapping and map integration.
Microsatellite and SNP markers were used for genotyping analysis and linkage map construction. A tailed primer protocol was used to amplify microsatellite alleles46,47. PCR products were analyzed on a 3130xl Genetic Analyzer. Restriction site–associated DNA (RAD) technology48 was used to develop polymorphic SNP markers. JoinMap4.0 software was used to perform the linkage analysis. Linkage between markers was examined by estimating the logarithm of odds (LOD) scores for the recombination rate, and map distances were calculated using the Kosambi mapping function. We then used BLAT (with alignment length coverage of >70%) to align the molecular markers to scaffolds. We linked the scaffolds onto chromosomes with a string of 100 Ns representing the gap between 2 adjacent scaffolds on the basis of a high-resolution genetic map.
Both homology-based and de novo prediction analyses were used to identify the repeat content in the carp genome. For the homology-based analysis, we used Repbase (version 20120418) to perform a TE search with RepeatMasker (3.3.0) and the WuBlast search engine. For the de novo prediction analysis, we used RepeatModeler to construct a TE library. Elements within the library were then classified using a homologous search with Repbase and a Support Vector Machine (SVM) method (TEClass).
Gene prediction and functional annotation.
We used three approaches for gene prediction: ab initio gene prediction, sequence homology–based prediction and expression evidence–based prediction. Briefly, two ab initio prediction software programs, AUGUSTUS49 and FGENESH50, were used to predict genes in the repeat-masked genome sequences. Gene model parameters for the programs were trained from long genes and known teleost genes processed by PASA51. Sequence homology–based gene prediction included both raw and precise alignments. First, protein sequences from the NCBI non-redundant (nr) database and 68 species sequences in Ensembl (version 68)52 were collected to build a database. Assembled genome sequences were aligned to their corresponding protein sequences in the database using BLASTX. Identified homologous proteins were selected and then aligned to the genome with TBLASTN. Adjacent and overlapping matches were merged using Perl scripts, building the longest protein for each genomic sequence region. Each target region in the genome was then extended by 10 kb from both ends of the aligned region to cover potential UTRs. Protein sequences were then aligned to those genome fragments by Genewise53. Transcriptome reads were generated using the Roche Genome Sequencer FLX (previously released data; available from the Sequence Read Archive (SRA) under accessions SRA009366 and SRA050545) and the Illumina HiSeq 2000 (Supplementary Note). Reads were mapped to genomic sequences by TopHat54, and Cufflinks55 was used to produce transcript assemblies. For a gene locus with several alternatively spliced transcripts generated by Cufflinks, the transcript with the longest exon length was chosen. All evidence was merged to form a comprehensive consensus gene set using EVM27. PASA was used to update the EVM consensus predictions by adding UTR annotations. To obtain gene function annotations, BLAST searches were conducted against the NCBI nr, SwissProt and TrEMBL protein databases, and homologs were called with E values of <1 × 10−5. The functional classification of GO categories was performed using the InterProScan56 program. Pathway analysis was performed using the KEGG57 annotation service, the KEGG Automatic Annotation Server (KAAS)58.
Comparative genomic analysis.
Protein-coding genes and coding DNA sequences from 11 species (D. rerio, G. aculeatus, O. latipes, T. rubripes, T. nigroviridis, X. tropicalis, A. carolinensis, H. sapiens, M. musculus, S. scrofa, G. gallus and C. intestinalis) were downloaded from Ensembl (version 68)52. For genes with alternatively spliced variants, only the longest transcript was selected. Any genes encoding proteins of fewer than 30 amino acids were discarded. The OrthoMCL pipeline59 was used to define gene families in the common ancestor of the species. All-against-all similarities were performed using BLASTP, with an E-value cutoff of 1 × 10−5. The well-aligned regions of each gene family, aligned using MUSCLE60, were extracted with Gblocks61. Phylogenetic analysis of the superalignments was performed using a maximum-likelihood method implemented in PhyML62 with the Jones-Taylor-Thornton (JTT) model. C. intestinalis was selected as the outgroup. We used MCScanX63 to identify syntenic blocks for C. carpio and D. rerio, with the gap size set to 15 genes and at least 5 syntenic genes. Circos64 was used for visualization.
To detect the conserved synteny blocks generated by the fourth round of genome duplication, we identified the reciprocal best-match paralogs from the above all-against-all BLASTP comparisons. Two chromosome regions with the gap size set to 15 genes and at least 5 syntenic genes were considered to have been duplicated.
Genome evolutionary analysis.
We used two methods to detect genome duplication signatures. All-against-all BLASTP comparisons (E value < 1 × 10−5) were used to identify pairs of homologous genes. For each homologous gene pair in C. carpio, the synonymous site divergence value (Ks) was calculated using the CodeML program (run mode −2) from the PAML package65. The distributions of Ks values for D. rerio paralogous pairs and pairs between D. rerio and the common carp were analyzed using the same pipeline. We calculated the 4dTv values of paralogous pairs within species and of orthologous pairs between species to give the distribution of the 4dTv value to estimate the speciation and WGD event that occurred during evolutionary history.
Whole-genome resequencing and phylogenetic analysis.
The 10 strains of C. carpio, consisting of 33 individuals, were randomly collected across Europe, North America and China. Danube River carp and Szarvas 22 were collected from the carp live gene bank of the Research Institute for Fisheries, Aquaculture and Irrigation of Hungary (HAKI). North American carp were collected from the Chattahoochee River in the United States. All other strains or wild populations were collected from China, including Songpu carp from the Heilongjiang Fishery Research Institute; Yellow River carp from the Henan Academy of Fishery Sciences; Heilongjiang River carp from Fuyuan county, Heilongjiang province; Hebao carp from Wuyuan county, Jiangxi province; Xingguo red carp from Xingguo county, Jiangxi province; Oujiang color carp from Longquan county, Zhejiang province; and koi from the breeding population of the Beijing Fishery Research Institute. Fin chips and blood samples were collected, and DNA was extracted using the DNeasy Blood and Tissue kit (Qiagen). Genome resequencing was conducted using the Illumina HiSeq 2000 platform. Paired-end reads from each accession were aligned to the reference genome using the Burrows-Wheeler Aligner (BWA)66. After mapping, SNPs were identified on the basis of the mpileup files generated with SAMtools67. The filtering threshold was set to require a read depth of ≥10 and a quality score of ≥20. Genotypes supported by at least two reads and with a minor allele frequency of ≥0.1 were assigned to each genomic position. We performed all-against-all BLASTP for genes in 5 teleosts (C. carpio, D. rerio, T. rubripes, O. latipes and G. aculeatus) to determine the similarity for each gene pair and to identify single-copy genes, obtaining 8,375 homozygous SNPs from 7,709 single-copy genes. A maximum-likelihood tree was constructed with PhyML68 and displayed with MEGA69. PCA was performed with EIGENSOFT70, and homozygous SNPs were used to investigate the population structure using STRUCTURE39 with 2,000 iterations and 2–8 clusters (K). The result of the structure matrix was plotted using DISTRUCT71 software.
Genome diversity analysis and comparison.
We calculated the π distribution for each linkage group using a sliding window method with Tajima's D test in Variscan72 software. The window width was set to 50 kb, and the stepwise distance was 10 kb. π values were compared, and the ratios were sorted. Using the ratio values, we identified the regions with the 1% highest and lowest diversity, and annotated genes were analyzed. Putative SNPs and deletions in the coding regions were identified by mapping the RNA-seq reads to annotated reference genes using BWA and SAMtools. PCR was performed on the deletion regions to verify the identified gene deletions. PCR products were analyzed via electrophoresis on a 2% agarose gel.
RNA sequencing analysis.
RNA was extracted from the skin tissues of 18 Hebao and 18 Songpu individuals and pooled for each strain. RNA-seq reads were generated using the Illumina HiSeq 2000 platform (Supplementary Note). Reads with a low quality score and a read length of less than 10 bp were removed. All cleaned reads were mapped to the assembled reference with Bowtie73. Then, RSEM (RNA-Seq by Expectation Maximization)74 was used to estimate and quantify gene and isoform abundance. Gene expression was measured in fragments per kilobase of exon per million fragments mapped (FPKM)55. Finally, edgeR75 was used to normalize the expression levels in both strains to identify the differentially expressed transcripts by pairwise comparisons. For qRT-PCR validation, total RNA was isolated and purified from all of the samples using the RNeasy kit (Qiagen) and was quantified using a NanoDrop and a Bioanalyzer 2100 (Agilent Technologies). qRT-PCR was performed on the ABI PRISM 7500 Real-Time PCR System with three replicates using the QuantiTect SYBR Green PCR kit (Qiagen). The actb gene was used as the internal reference. Primer information is provided in Supplementary Table 25. Two-sided t tests were used to compare expression levels. GO annotation of the genes was performed on the basis of orthologous relationships with the gene set of D. rerio. Pathway analysis was performed using Ingenuity Pathway Analysis (IPA) tools (Ingenuity Systems).
Celera Assembler, http://wgs-assembler.sourceforge.net/; BWA, http://bio-bwa.sourceforge.net/; FGENESH, http://www.softberry.com/; Ensembl, http://www.ensembl.org/; Gene Ontology (GO), http://www.geneontology.org/; KEGG, http://www.genome.jp/kegg/; Repbase, http://www.girinst.org/repbase/index.html; RepeatMasker, http://repeatmasker.org/; Food and Agriculture Organization of the United Nations (FAO) statistics, http://www.fao.org/fishery/culturedspecies/Cyprinus_carpio/.
The common carp whole-genome shotgun sequencing, genome resequencing and RNA sequencing results have all been deposited in GenBank under project accession PRJNA202478. The genome assembly has been deposited in the European Nucleotide Archive (ENA) under project PRJEB7241.
European Nucleotide Archive
Sequence Read Archive
We acknowledge Z. Zhu, N. Li, J. Gui, Z. Bao, G. Zhang, X.L. Zhang, J. Li, Y. Liu, Q. Liu and S. Chen for their support on the common carp genome project. We thank H. Hu, G.Z. Liu, J.H. Yu and C.J. Li for their assistance in sample collection and Y. Wan, T. Sun, W. Liu, L. Jiang, Shu Wang, Y. Zhu, X. Xing, P. Zhou and R. Cui for genotyping and sequencing. We thank J. Postlethwait, G. Hulata, L. David, L. Orban and F. Zhao for their helpful discussions. We acknowledge grant support from the National High-Technology Research and Development Program of China (863 program; 2011AA100401, 2011AA100402 and 2009AA10Z105), the National Department Public Benefit Research Foundation of China (200903045), the National Basic Research Program of China (973 Program; 2010CB126305), the National Natural Science Foundation of China (31302174 and 31101893) and Special Scientific Research Funds for Central Non-Profit Institutes of the Chinese Academy of Fishery Sciences (2009B002 and 2011C016).
Supplementary Tables 6, 19–21 and 23–25