Global tissue transcriptomic analysis to improve genome annotation and unravel skin pigmentation in goldfish

Goldfish is an ornamental fish with diverse phenotypes. However, the limited genomic resources of goldfish hamper our understanding of the genetic basis for its phenotypic diversity. To provide enriched genomic resources and infer possible mechanisms underlying skin pigmentation, we performed a large-scale transcriptomic sequencing on 13 adult goldfish tissues, larvae at one- and three-days post hatch, and skin tissues with four different color pigmentation. A total of 25.52 Gb and 149.80 Gb clean data were obtained using the PacBio and Illumina platforms, respectively. Onto the goldfish reference genome, we mapped 137,674 non-redundant transcripts, of which 5.54% was known isoforms and 78.53% was novel isoforms of the reference genes, and the remaining 21,926 isoforms are novel isoforms of additional new genes. Both skin-specific and color-specific transcriptomic analyses showed that several significantly enriched genes were known to be involved in melanogenesis, tyrosine metabolism, PPAR signaling pathway, folate biosynthesis metabolism and so on. Thirteen differentially expressed genes across different color skins were associated with melanogenesis and pteridine synthesis including mitf, ednrb, mc1r, tyr, mlph and gch1, and xanthophore differentiation such as pax7, slc2a11 and slc2a15. These transcriptomic data revealed pathways involved in goldfish pigmentation and improved the gene annotation of the reference genome.

www.nature.com/scientificreports/ predicted with limited RNA-Seq data. Additional data on alternative isoforms and untranslated regions are critical for understanding the phenotypic diversity in goldfish. In this study, we combined the PacBio and Illumina sequencing to obtain a comprehensive transcriptomic landscape of goldfish stain, Oranda. PacBio Singlemolecule real-time (SMRT) long read sequencing displays superiority in higher complete readouts of full-length transcripts and greater accuracy in splice isoforms over the Illumina short-read sequencing 11 . It has been widely applied to obtain full-length transcripts in species with reference genomes such as human 12 , zebrafish 13 , rice 14 , and sorghum 15 , and species without reference genomes such as Gymnocypris selincuoensis 16 and Misgurnus anguillicaudatus 17 . We combined the PacBio long-read sequencing and Illumina short-read sequencing technologies to obtain a more complete transcriptome in goldfish. The goldfish genome annotations were further improved with these transcriptome data and novel genes and different isoforms were identified compared with the reference annotation. In addition, we also identified genes involved in pigmentation pathways using goldfish with different color skin. The transcriptomic data and genomic resources provided in this study improved the goldfish genome annotation and enhanced the understanding of pigmentation pathways and would facilitate future researches on genetic basis of diverse phenotypes.

Results
PacBio full-length sequencing and processing. To ensure higher coverage of genes and their isoforms, we pooled RNA samples from 13 tissues of one cyan-skin adult goldfish and larvae at one-and three-days post hatch (dph) for PacBio sequencing (Fig. 1). We obtained 25.52 Gb post-filter polymerase reads that contained 11,448,193 subreads with an average length of 2,184 bp and N50 length of 3,108 bp, respectively ( Table 1). The subreads were utilized to generate circular consensus sequences (CCS) and a total of 632,099 CCS reads were obtained. Among the CCS reads, 573,366 (90.71%) reads were identified as full-length non-chimeric (Flnc) transcripts (containing 5′ primer, 3′ primer and the polyA tail), 47,515 (7.52%) reads were identified as non-full length non-chimeric (Nflnc) reads, and 11,218 (1.77%) were identified as chimeric reads (CR). The mean length and N50 length of Flnc reads were 2920 bp and 3924 bp, respectively (Table 1). Subsequently, the Flnc reads were clustered by Iterative Clustering for Error Correction (ICE) algorithm to generate the consensus sequences and the Nflnc sequences were used to polish the ICE consensus sequences. A total of 300,733 full-length polished consensus sequences were obtained with an average length of 2949 bp and N50 length of 4000 bp. The polished consensus sequences were further rectified with Illumina RNA-Seq data, resulting in a slightly improved average length of 2956 bp and N50 length of 4017 bp (Table 1).
PacBio data genome mapping. The corrected consensus sequences were mapped against the goldfish reference genome to further improve gene structure annotation using GMAP. More than 95.48% sequences (287,125) were successfully mapped onto the reference genome while the remaining 13,608 (4.52%) sequences were unmapped. Among the mapped sequences, 250,250 (83.21%) sequences were uniquely mapped, and of which, 148,676 (49.44%) sequences were mapped onto the positive strand while 101,574 (33.78%) sequences were mapped onto the negative strand of the reference genome. The remaining 36,875 (12.26%) sequences showed multiple alignments on the reference genome (Fig. 2a).

Isoform detection and characterization.
A total of 137,674 non-redundant transcript isoforms were obtained and classified into three types compared to reference genome annotation (Fig. 2b). The first type, or the known isoforms of reference genes, contained 7,626 isoforms accounted for 5.54% of the identified isoforms. The second type, novel isoforms of reference genes, consisted of 108,122 isoforms accounted for most of the identified isoforms (78.53%) (Supplementary Table S1). The third type, isoforms of new genes, which were aligned to the un-annotated regions of the reference genome, comprised 21,926 isoforms accounted for 15.93% of the identified isoforms.
The non-redundant isoforms were aligned to 44,195 gene loci, of which 26,573 (60.13%) gene loci were overlapped with reference protein-coding genes, and the remaining 17,622 (39.87%) gene loci were aligned to unannotated loci and identified as new genes compared to the reference genome annotation (Fig. 2c, Supplementary  Table S2). These new gene loci contained both protein-coding genes (10,713 loci) and nonprotein-coding genes (5,808 loci), namely, lncRNA genes. In addition, 1,101 loci contained both protein-coding genes and lncRNA genes. In total, 11,814 gene loci contained protein-coding genes (Fig. 2d).The distribution of isoform length was compared between PacBio Iso-Seq transcripts and the reference genes. The density distribution curve showed that PacBio Iso-Seq isoforms were longer than the reference genes (Fig. 2e). The average length and N50 length of PacBio Iso-Seq isoforms were 2985 bp and 3939 bp, respectively, whereas those of reference genes were 1378 bp and 2227 bp, respectively (Fig. 2f). Moreover, 21,291 out of 44,195 (48.18%) PacBio Iso-Seq genes possessed two or more transcript isoforms with an average of 3.12 isoforms per gene while each reference gene was only annotated with a single isoform (Fig. 2g Figure 1. Overview of the experimental design and the data processing pipeline. This study consisted of experimental design, library construction and sequencing, and data processing and analyses. Goldfish samples include 13 tissues from one cyan fish, four color skins and larvae at 1 and 3 dph. All 22 samples were subject to Illumina sequencing, and only 13 tissues from one cyan fish and larvae at 1 and 3 dph were used for PacBio sequencing. The figures were constructed in Microsoft Office PowerPoint.  18 . This gene regulates programmed alternative splicing from development to diseases 19 . In summary, the genes encoded in goldfish genome had complex AS events. Classification of non-coding RNAs. A total of 11,742 lncRNA transcripts encoded in 10,240 gene loci (Supplementary Table S6) were identified (Fig. 4b). The lncRNAs were classified into four types according to their positions on the reference genome. Large intergenic noncoding RNA (lincRNA) accounted for the majority   www.nature.com/scientificreports/ of lncRNA (54.79%), whereas antisense lncRNA accounted for 23.62% of the lncRNA. In addition, sense intronic and sense overlapping lncRNAs accounted for 11.07% and 10.53% of the total lncRNAs, respectively (Fig. 4c).
The length distribution curves of lncRNA and mRNA transcripts showed that lncRNA was shorter than mRNA ( Fig. 4d). Moreover, 83.7% (9,826) lncRNA genes possessed only one exon while mRNA genes exhibited an average of 9.9 exons (Fig. 4e). These lncRNA transcripts were compared with 2,806 lncRNA genes annotated in goldfish genome on Ensembl database (version 101.1). The results showed that 433 lncRNAs were overlapped between our prediction and the Ensembl annotation.
Global tissue transcriptomic analysis. The Illumina clean reads were mapped onto the goldfish reference genome, and the expression level of each isoform was estimated using the updated GFF file containing 87,946 genes from this study and the reference genome. Each Illumina RNA-Seq libraries generated more than 5.75 Gb clean data. The mapping rate of clean reads varied from 67.09% (Sn1) to 86.56% (dph1) (Supplementary Table S7). Around five thousand genes were expressed (FPKM > = 0.1) in each adult tissue, with the brain and testis tissue expressed the most genes (59,789 and 59,286) (Fig. 5a). Gene expression cluster analysis was performed using pheatmap R package 1.0.12 with hierarchical clustering method of Euclidean distance. The heatmap results showed that tissues clustered into several groups. The skin, fin, hood, and intestine were clustered together while muscle, heart and eyes formed the second group. The spleen, kidney and gill clustered into the third group. The testis and brain were clustered together probably due to the number of closely expressed genes. In general, these clustering patterns demonstrated that functionally related tissues shared similar gene expression profiles (Fig. 5b).
Skin-and color-specific transcriptomic analysis. To identify skin-specific genes, differentially expressed genes were analyzed between skin and other tissues. The genes up-regulated in skin (fold change > 2 and FDR < 0.05) compared with at least other 9 tissues were defined as the skin-specific expressed genes (Fig. 6a). KEGG pathway analysis revealed that the skin-specific genes were significantly enriched in several signaling pathways, including ECM-receptor interaction, melanogenesis, cell adhesion molecules, focal adhesion, tyrosine metabolism, and so on (Fig. 6b).
To further identify the genes and pathways related to color-specific skin, we investigated gene expression profiles of four different color skins (black, cyan, red, and white) with RNA-Seq. A total of 162 DEGs were identified among four color skins (Fig. 6c, Supplementary Table S8). The DEGs of color-specific skin were significantly involved in melanogenesis, tyrosine metabolism, riboflavin metabolism, and folate biosynthesis (Fig. 6d). We found that 13 out of the 162 DEGs were involved in pigmentation pathways 8,20 (Supplementary  Table S9). Generally, these 13 DEGs showed low expression levels in white skin. Mitf and Ednrb were involved in melanophore development. Mc1r and Mc5r were associated with the regulation of melanogenesis. Tyr and Tyrp1b were components of melanosomes. Mlph were important for melanosome transport. All of them played a role in the biological processes of melanogenesis 5,8 . On the other hand, Gch1 was involved in pteridine synthetic pathway. Pax7, Slc2a11, and Slc2a15 were associated with xanthophore development and differentiation 8,21 .
To gain further insight of the pigmentation regulation in different color skins, we used the expression levels of the pigmentation genes from four-color skins and constructed a gene regulatory network of pigmentation in goldfish (Fig. 7). As expected, the network of melanogenesis, tyrosine metabolism, and pteridine synthesis interacted with each other, and mediated by at the secondary message cAMP . www.nature.com/scientificreports/

Discussion
High-quality Iso-Seq significantly improved goldfish genome annotation. Long-read sequencing technology can capture full-length transcripts without assembly and overcome the difficulties presented by short-reads 22 . However, PacBio Iso-Seq is still rarely used in aquaculture animals. In the present study, combining the PacBio Iso-Seq and Illumina RNA-Seq, we firstly reported the high coverage of full-length transcriptome that significantly improved the published goldfish genome annotation. The Iso-Seq data from 13 pooled samples generated 300,733 high-quality sequences. The mean length and N50 length of the full-length transcripts were 2956 bp and 4,017 bp, which were longer than those of the de novo assembled transcripts generated with RNA-Seq in goldfish (879.5 and 1318 bp), and in other fish species (815 and 1479 bp) [23][24][25] . Moreover, our results are comparable to those generated with PacBio Iso-Seq in Gymnocypris selincuoensis (3, 509 and 3,870 bp) 16 . In addition, the PacBio Iso-Seq technology exhibits the advantage of discovering more novel isoforms and AS events in many species, including reference genome-free species 26 . In this study, we identified 137,674 isoforms from 44,195 gene loci, including 108,122 (78.53% of total isoforms) novel isoforms of previously known genes and 17,622 novel genes. Even in well annotated genome of model animal, more novel isoforms could still be identified using PacBio Iso-Seq. A recent study on zebrafish using PacBio Iso-Seq discovered 1835 novel isoforms of known genes that have not been annotated in the RefSeq catalog 13 . Compared to goldfish, Gymnocypris selincuoensis, the cyprinid fish living in lake Selincuo on the Qinghai-Tibet Plateau, may present fewer AS events (2069 pairs) in full-length transcripts 16  We investigated the gene expression pattern of the newly discovered 11,814 protein-coding genes (Fig. 2d). The heatmap showed that most of the genes were specifically and dominantly expressed in one or two functionally  Table S10, and Fig. S2). We also investigated the gene expression pattern of the 11,742 lncRNA transcripts. Similar to the novel protein-coding genes, the lncRNA genes were also specifically and dominantly expressed in one or two functionally similar tissues (Supplementary Table S11, and Fig. S3). This is the possible reason why these newly identified protein-coding genes were missing in the reference genome annotation with limited RNA-Seq data.
To evaluate the gene completeness of the transcriptome, the 137,674 non-redundant transcript isoforms were analyzed and assessed by Benchmarking Universal Single-Copy Orthologs (BUSCOs) 28 using the vertebrate core gene sets (vertebrata_odb10), resulting in 2,673 complete (79.7%), 145 fragmented (4.3%), and 536 (16.0%) missing BUSCOs of total 3,354 BUSCOs ( Supplementary Fig. S4). Higher percentage of the missing BUSCOs than the reference genome annotation (5% missing) is probably due to low coverage of the goldfish transcriptome by PacBio Iso-Seq sequencing (the polymerase reads number is about tenfold of total genes, Supplementary  Table S12). This is also why many reference genes were not covered by PacBio Iso-Seq genes (Fig. 2C). Combined with the reference genome annotation, our work greatly improved goldfish genome annotation, identified more protein-coding genes and more isoforms for each gene.

Identification of color-specific genes.
Skin color in teleost fish is essential for social events, including avoiding predators, catching prey and communicating with conspecifics 4 . Coloration is represented by the interaction of chromatophores, including melanocytes, xanthophores, erythrophores, and cyanophores 5 . Emerging www.nature.com/scientificreports/ studies have revealed the genetic mechanism of skin pigmentation in fish. For instance, both blue and brown colors are developed from the implicit homozygous, the former is controlled by one pair and the latter is controlled by four pairs of Mendelian factors 29 . Whole genomic sequencing of goldfish with various body color phenotypes demonstrated that the presence of two evolutionally-asymmetric subgenomes in the goldfish genome may contribute to production of a higher number of varieties in body colors compared to other teleost fish 30 .
In this study, we identified color-specific genes and revealed that DEGs of color-specific skin were significantly involved in melanogenesis, tyrosine metabolism, riboflavin metabolism, and folate biosynthesis. In accordant with goldfish, in Triplophysa siluroides, the candidate genes related to skin color variation mainly participate in melanosome, melanin metabolic process, and melanin-concentrating hormone activity 31 . Moreover, the differently expressed genes in different colored skin of koi carp have been shown to be enriched in melanogenesis and tyrosine metabolism 32 . Tyrosine metabolism genes are also involved in the gray-to-red body color formation of red crucian carp 33 . Due to diverse color pattern, goldfish could be a perfect model to study the genetic mechanism of pigmentation. The red-colored goldfish was used for studying the neuroendocrine effects on body color, because it contains iridophores and xanthophores but not melanophores in the scales or skin 34 . Transcriptome sequencing of red crucian carp and white crucian carp identified that mitfa, tyr, and tyrp1 were significantly down-regulated and gch1 was up-regulated in red crucian carp 35 . In this study, we compared transcriptomes of four goldfish skins with different colors and obtained 162 DEGs. Among the DEGs, mitf, tyr, tyrp1, and ednrb played important roles during the melanocyte development 20 . Mitf is a microphthalmia-associated transcription factor, which also regulates the development of melanocyte in mammals, and determines the formation of melanocytes and iridophores in zebrafish 36 . After FSGD, mitf have been duplicated in teleost fishes. For instant, zebrafish genome encoded two paralogous genes of mitf, namely mifta and miftb. Mitfa is involved in development of chromatophore, while mitfb is expressed in retinal pigment epithelium 37 . The melanocortin receptors (MCRs), belonging to G-protein coupled receptors (GPCRs), exert multiple functions in the control of pigmentation in the melanocortin system 38 . Five types of melanocortin receptors (MC1R, MC2R, MC3R, MC4R, and MC5R) have been identified in goldfish 39 , and mc1r and mc5r were expressed in the xanthophores in scales 34 . In this study, mc1r and mc5r had different expression level among the skins and had no expression in the white skin. GTP cyclohydrolase I (Gch1), the ratelimiting enzyme in pteridine synthesis, catalyzes the de novo synthesis of tetrahydrobiopterin (H4biopterin) from GTP 5 . Gch1 is also an essential cofactor for all nitric oxide synthases, and protects skin cells by restoring cellular H4biopterin and NO against radiation-induced damage 40 . GchI expression is an initial step for melanophore and xanthophore differentiation due to its involvement in different component pathways 5,41 . The DEG gch1 had high expression in three skins, which suggested that gch1 maybe play an important role in goldfish pigmentation. Three other DEGs, pax7a, slc2a11, and slc2a15, may have functions in xanthophore differentiation in goldfish as previous study showed that Pax7a, Slc2a11 and Slc2a15 are important for the development and differentiation of xanthophore and leucophore in medaka 21 .
In addition to protein-coding genes, lncRNAs may also be involved in the skin color variation 32,42 . We investigated lncRNA expression across four color skins (Supplementary Table S13 and Fig. S5) and identified the 140 differentially expressed lncRNAs between the different color skins (Supplementary Table S14 and Fig. S6, 7). Further investigation is required to confirm their functions in skin color variation.
Comparison of ohnolog genes located in two subgenomes. The common ancestor of goldfish and common carp underwent a fourth round of WGD (Cs4R, carp-specific WGD) approximately 8-14.4 million years ago 9,10 . Cs4R was suggested to be an allotetraploidization event, which is the doubling of a complete set of chromosomes following interspecific hybridization of diploid progenitors (2n = 50) 43 . According to the distribution of transposable elements on the goldfish chromosomes (2n = 100), Kon et al. partitioned the goldfish chromosomes into two subgenomes: L (relatively long) and S (relatively short) 30 . We compared the number of isoforms, exon phases, isoform expression variance and entropy between the 5,404 pairs of ohnolog genes on L and S subgenomes (The ohnolog pairs kindly provided by Prof. Yoshihiro Omori). The genes located in the L subgenome (L-genes) had slightly more isoform number (pairwise t test p = 0.001081, Supplementary 2 than the genes located in the S subgenome (S-genes), but the tissue entropy (or tissue specificity) of the L-genes was slightly lower (higher) than that of the S-genes (p = 0.066 (0.030)) ( Supplementary Table 15.3, 15.4). The isoform entropy and standard deviation of expression of the L-genes were larger than the S-genes in most tissues (Supplementary Table 15.5, 15.6, Supplementary Fig. S8), suggesting the L-genes expressed more diversely across different isoforms. There was no significant difference of isoform specificity between L and S sub-genome, although the isoform specificity was lower on the L-genome (Supplementary Fig. S9). In summary, genes located in the L and S subgenomes evolved asymmetrically 30 , and that the L subgenome is the dominant subgenome, which is more often preserved and similar to the ancestral state 44,45 .

Conclusion
In summary, our work demonstrated that full-length transcriptome sequencing has advantages in identifying novel isoforms, new gene loci and AS events. Our work significantly improved goldfish genome annotation. In addition, we presented pigment genes expression profile from different color goldfish skins, and this work enhanced our understanding of the pigmentation regulatory pathways in goldfish. www.nature.com/scientificreports/
RNA isolation, full-length cDNA library construction and PacBio sequencing. Total RNA was isolated from each sample using TRIzol reagent according to the manufacturer's manual (Thermo Fisher Scientific, USA). The total RNA samples were pooled equally from 13 tissues of one cyan-skin adult goldfish, and Ranchu larvae at one and three dph, and purified using oligo-dT magnetic beads. PolyA RNA was then reversely transcribed into cDNA using the Clontech SMARTer PCR cDNA Synthesis Kit (TaKaRa Bio, Japan). Four fractions with different insert size (1-2 kb, 2-3 kb, 3-6 kb and 0.5-6 kb) were selected using the BluePippin system (Sage Science, Inc., USA). The resulting cDNAs were amplified by KAPA HiFi PCR Kits (Kapa Biosystems, Inc., USA). SMRTbell libraries were constructed using SMRTbell Template Prep Kit 1.0 (PacBio, USA). Twelve SMRT cells were used to sequence SMRTbell libraries (nine cells for the mixed libraries of 1-2 kb, 2-3 kb, 3-6 kb and three cells for the 0.5-6 kb library) on the PacBio Sequel system (Supplementary Table S12).
Illumina short-read library construction and sequencing. Sequencing library for each sample was constructed using a TruSeq RNA Library Preparation Kit (Illumina, USA) following the manufacturer's protocols. In brief, mRNA from each sample was purified from total RNA utilizing oligo-dT magnetic beads and fragmented with divalent cations. First strand cDNA was synthesized utilizing with M-MuLV Reverse Transcriptase (RNase H-) and random hexamer primers. Subsequently, DNA polymerase I, and RNase H were used to synthesize second cDNA strand. Through DNA end repair, a single nucleotide A was added at the 3′-end, and the DNA fragments were ligated to Illumina adaptors. To isolate DNA fragments of preferential ~ 300 bp in length, the library fragments were purified with AMPure XP system (Beckman Coulter, USA). PCR was then amplified using Phusion High-Fidelity DNA polymerase with Universal PCR primers and Index (X) Primers. Finally, the PCR fragments were purified with AMPure XP system again, and each library quality was assessed with the Agilent Bioanalyzer 2100 system (Agilent Technologies, Inc. USA) 46,47 .
The clustering for the index-coded samples was carried on a cBot Cluster Generation System by using the TruSeq PE Cluster Kit v3-cBot-HS (Illumina) according to the manufacturer's specifications. After cluster generation, the libraries were sequenced with the paired-end 150 bp reads model on an Illumina HiSeq X Ten platform 46,47 . PacBio data filtering and processing. PacBio Iso-Seq raw reads were classified and clustered into transcript consensus using the PacBio SMRT Link v5.1 pipeline (Fig. 1). After data filtering, including adapter removal and elimination of low-quality regions, the subreads were utilized to generate CCS (parameters: -min-Passes = 2, minPredicted accuracy = 0.8). The Iso-Seq classifying tool categorized the CCS into full-length nonchimeric (Flnc), non-full length non-chimeric (Nflnc) and chimeric reads (CR), based on cDNA primers and poly-A tail signal. Subsequently, the Flnc reads were clustered by Iterative Clustering for Error Correction (ICE) algorithm to generate the cluster consensus sequences 48 and the Nflnc sequences were used to polish the ICE consensus sequences by Arrow software (Fig. 1). The consensus sequences were further rectified by LoRDEC 49 software with Illumina RNA-Seq data (Fig. 1).
Gene structure analysis was performed using TAPIS v1.2.1 pipeline 15 . The GMAP output bam format file and GFF format genome annotation file were used for gene and isoform determination. The high-quality goldfish reference genome contains roughly 70,324 gene models and each gene model contains one transcript isoform 10 . The novel transcript isoforms which possessed different structures from those predicted in the reference genome were identified with the following criteria: 1) the novel transcript isoform contained different exon numbers compared to the reference transcript isoform, or 2) the novel transcript contained the same exon numbers as the reference transcript, but with at least 6-nt difference in the stop coordinate for the first exon, 6-nt difference in the start coordinate for the last exon, or 6-nt difference in the start or stop coordinate for the middle exons.
The new genes were identified by comparing to the reference genes with the following criteria: (1) the query contained no overlap or less than 20% overlapped sequences of the reference gene, or (2) the query contained more than 20% overlapped sequences of the reference gene, but the direction for transcription was different. The related results presented in Fig. 2 were drawn with ggplot 2 R package 51 .

Annotation analysis of new transcripts and genes. The unmapped transcripts and new genes
mapped onto reference genome were annotated with seven databases including NR, NT, Pfam, KOG, SwissProt, KEGG 53 and GO databases.
LncRNA identification and characterization. Long non-coding RNAs (lncRNAs) were identified with CNCI (Coding-Non-Coding-Index), PLEK, CPC (Coding Potential Calculator), and Pfam (Fig. 4a). Briefly, the protein-coding potential of the transcripts were predicted by CNCI and PLEK using the default parameters and blasted in NCBI eukaryotes' protein database using CPC and Pfam. The parameter for CPC was set at e-value of '1e-10′. Pfam hmmscan searching was performed with default parameters of -E 0.001 -domE 0.001. Transcripts with protein-coding potential by one or all aforementioned tools were filtered out, and those without proteincoding potential were candidates for lncRNAs. The related results presented in Fig. 4 were drawn with ggplot 2 R package 51 .
Global tissue transcriptomic analysis. The Illumina paired-end clean reads were mapped onto the goldfish reference genome using Hisat2 software 54 . The expression level of each isoform was estimated by Cuffdiff v2.1.1 55 , using the updated GFF file containing 87,946 genes from this study and the reference genome. The read numbers of different isoforms for each gene were counted and the gene expression level in FPKM (expected number of Fragments Per Kilobase of transcript sequence per Million base pairs sequenced) was calculated with HTSeq v0.9.1 56 . Gene expression cluster analysis was performed using pheatmap R package 1.0.12 with hierarchical clustering method of Euclidean distance 57 .

Skin-and color-specific transcriptomic analysis.
To identify genes significantly expressed in skin tissue, differentially expressed genes (DEGs) were compared between skin and other tissues with DEGseq R package (1.20.0) 23,58 . The resulting p-values were adjusted using the Benjamini & Hochberg's method to control the false discovery rate (FDR). Gene Ontology (GO) enrichment analysis of DEGs was performed via the GOseq R package of Bioconductor 3.5 59 . GO terms with FDR < 0.05 were identified to be significantly enriched for DEGs. The statistical enrichment (FDR ≤ 0.05) of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways 53 for DEGs was determined by KOBAS 3.0 software 60 .
To further identify the genes and pathways related to color-specific skin, we investigated gene expression profiles of four different color skins (black, cyan, red, and white) with RNA-Seq (Fig. 1). DEGs (FDR < 0.05) were compared between any two color skins with DEseq R package (1.18.0) 61 . GO enrichment and KEGG pathway analysis were also performed. The related results were drawn with ggplot 2 R package 51 .
Comparison of ohnolog genes located in two subgenomes. The gene expression FPKM is the sum of all isoform FPKM of the gene. Log2 FPKM are used for all statistic tests. Pairwise t test was performed using R stats package 62 with 'paired = T' . Entropy was computed using entropy R package version 1.2.1 63 . The isoform (or tissue) specificity for each gene is defined as max Ethics approval. The animal experiments and all procedures involved in the handling and treatment of goldfish in this study were approved by Shanghai Ocean University. All procedures conducted on fish were performed in accordance with relevant guidelines and regulations. All efforts were made to minimize animal suffering.

Data availability
The PacBio and Illumina data are publicly available at the NCBI/GenBank database under Bioproject PRJNA587054 and PRJNA580146, respectively. All raw data have been deposited in the NCBI Sequence Read Archive under the accession number SRR10382503-SRR10382505 (PacBio reads) and SRR10358004-SRR10358025 (Illumina reads). The updated GFF file of gene models, the nucleotide and protein sequences of unmapped transcripts, novel-protein-coding genes, and lncRNA genes and other related files were also deposited in Figshare (https ://figsh are.com/s/dc583 090d8 8df61 a640f ) 65 .