Background & Summary

The order Acipenseriformes comprises 27 extant species of sturgeons, which are considered “living fossils” with a fossil record dating back to the upper Cretaceous1,2. In terms of scientific value, these chondrostean species preserve many transition traits between Chondrichthyes and Osteichthyes, occupying a vital evolutionary position. Sturgeons include three degrees of polyploidy, including ~120, ~240 and ~360 chromosomes, which might result from multiple and independent duplication events3,4. Sturgeons are important species for studying vertebrate evolution and whole genome duplication. Regarding economic value, black caviar is very popular in the European market, and sturgeon meat is also an important food resource5. Unfortunately, most sturgeons are currently endangered6.

Chinese sturgeon (Acipenser sinensis) is a large anadromous fish present in the main stream of the Yangtze River and East China Sea in China5,7. At present, Chinese sturgeon is critically endangered, which is mainly attributed to the following reasons: (i) asynchronous and delayed sexual maturity between male and female Chinese sturgeon (8–18 years for male and 14–26 years for female)6; and (ii) human activities, such as overfishing, damming, shipping and pollution7. The sex ratio (female:male) of wild Chinese sturgeon has gradually increased from 1.1:1 (1981–1983) to 5.86:1 (2003–2004)8. Therefore, artificial propagation and selective release are considered efficient approaches to recover the wild population size and adjust the sex ratio of the wild population of Chinese sturgeon. However, the sexes of A. sinensis cannot be distinguished due to a lack of secondary sexual characteristics and molecular markers of sex identification. Moreover, whether sex differentiation of A. sinensis is determined by genetics, the environment or a combination of both remains unclear.

Sturgeon genomes are complicated due to large genome sizes (1.6–9.32 pg/C), many chromosomes (consisting of macro- and micro-chromosomes) and controversial ploidy4,9,10. To date, no sturgeon reference genome is available. Transcriptome sequencing is very useful for identifying novel genes, checking gene activity, revealing genic functions, and exploring the molecular mechanisms of development and sex differentiation of sturgeon species11,12. Transcriptome analyses regarding sex determination and development of several sturgeons, including A. naccarii13, A. schrenckii14,15,16,17,18, A. sinensis6,19,20, A. gueldenstaedtii21, A. fulvescens11, A. baerii22, and A. dabryanus23,24, have been reported. Transcriptomes between male and female A. sinensis gonads were analysed in a previous study6 where some differential genes between the ovary and testis were predicted, and potential gametogenesis-related genes were screened out. However, identifying the sex determination mechanism of A. sinensis only by analysing genes in gonad tissues is not possible. The regulation of multiple tissues, signals and genes must be considered to unveil the mechanism of sex differentiation and discover sexual markers.

The reproductive process in vertebrates relies on multiple regulatory factors of the hypothalamus-pituitary-gonad (HPG) axis25,26. Gonadal differentiation, development, and maturation in fish are also regulated by many kinds of hormones, such as gonadotropin-releasing hormones, gonadotropin hormones, steroid hormones and other relative hormones, in the HPG axis14,27,28. Recent studies on the HPG axis of A. schrenckii focused only on the genes in the HPG axis from brain tissue by kisspeptin treatment14. However, obtaining transcriptome profiles of whole HPG tissues in sturgeon is necessary to better understand the reproductive mechanisms.

Here, the HPG axis transcriptomes of A. sinensis were sequenced using Illumina HiSeq 2000 and 4000 platforms. A total of 271.19 Gb filtered data from 45 samples belonging to 15 individuals (six individuals in stage I, six males and three females in stage II) were obtained, and 74.50% of 121,952 assembled unigenes were annotated to the NCBI non-redundant protein database (NR), Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), Cluster of Orthologous Groups of proteins (COG), InterPro, Gene Ontology (GO) and NT databases. Additionally, 78,770 unigenes contained complete open reading frames (ORFs). A total of 51,570 simple sequence repeats (SSRs) were detected. We identified 1,531 to 34,439 differentially expressed genes (DEGs) in 18 pairwise comparisons of HPG samples of male and female A. sinensis in two stages. We reported the first integrated transcriptome data from HPG samples of male and female A. sinensis in stage I and II of gonadal development. These data offer valuable resources for research into reproductive regulation and the HPG axis interaction in A. sinensis and other sturgeons.

Methods

Ethics statement

The experiments were performed in accordance with the guidelines and regulations of the National Institute of Health Guide for the Care and Use of Laboratory Animals and were approved by the Institutional Review Board on Bioethics and Biosafety of the Chinese Sturgeon Research Institute.

Sample preparation

Individuals of A. sinensis used here were selected from the artificial breeding population of the Chinese Sturgeon Research Institute, China Three Gorges Corporation (Yichang, China). The stages of gonadal development were classified following previously described methods29. The sex cannot be distinguished by a histochemical assay for gonads in stage I, while it can be easily distinguished in stage II in A. sinensis. Herein, 6 individuals (1 year old) in stage I and 9 individuals (6 males and 3 females over 4 years old) in stage II were selected. All experimental individuals were first euthanized with M222. Three kinds of tissues (hypothalamus (H), pituitary (P) and gonad (G)) were collected from each individual, flash frozen in liquid nitrogen and stored at −80 °C for RNA extraction. A total of 45 samples from 15 individuals (three tissues for each individual) were used for cDNA library construction.

Illumina sequencing and data processing

Total RNA was extracted with TRIzol reagent (Invitrogen) following the manufacturer’s instructions. The RNA purity, integrity and concentration were calculated and checked by an Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA), and high-quality RNA (RNA Integrity Number > 7.0) was used for Illumina sequencing. Total RNA samples (5 μg) were then subjected to cDNA construction following Illumina truSeq stranded mRNA sample preparation protocol. Next, 45 libraries were sequenced using Illumina HiSeqTM 2000/4000 with 2 × 90 or 100 bp paired-end sequencing. To obtain high-quality reads, the raw reads were filtered by SOAPnuke (v1.5.6)30 with the default parameters except “-l 20 -q 0.2 –M 3” according to the following criteria: (i) reads containing adaptors (adapters of more than 15 bases matched to reads with maximal 3-base mismatches allowed) were discarded, (ii) reads with a high proportion (> 5%) of unknown nucleotides (N) were removed, and (iii) reads with ≥ 20% bases Q ≤ 20 were removed. Here, the filtering parameters were stricter than the default parameters, and reads containing low quality bases or adapters were removed entirely, rather than being trimmed.

Transcriptome assembly and annotation

All data obtained from the 45 libraries were assembled by the Trinity program (version: release-2013-08-14), including Inchworm, Chrysalis and Butterfly modules, with default parameters except “–path_reinforcement_distance 85 –min_kmer_cov 3”31. To minimize redundancy, we clustered transcripts using TGICL32 and the non-redundant sequences of >200 bp were retained. Finally, the longest sequence was preserved and designated as a unigene. Following the filtering methods of the unigene set of Andrias davidianus33, we used a strict pipeline to filter these sequences with some modification to reduce the background and assembly errors. There are some differences compared with the published filtered cut-offs33. Here, if the lengths of unigenes were in the range of 200–500 bp with the fragments per kilobase per million mapped fragments (FPKM) < 50 in just one sample or FPKM < 1 in ten samples, they were removed. Several different public protein databases were used to validate and annotate the assembled unigenes for assigning gene names, coding sequences (CDS) and predicting protein annotations. The sequence-based alignments were mapped against the NCBI NR protein database, Swiss-Prot protein database, KEGG and COG using the BLASTx algorithm34,35 with an E-value threshold of 1e−5. The priority order of NR, Swiss-Prot, KEGG, and COG was set. The unigenes that did not align to any of the above databases were predicted as ORFs using ESTScan software36. To further evaluate the different sample mapping rates and the quality of the assembled unigenes, we aligned all tissue samples with high-quality reads to the unigenes using SOAP2.21, allowing up to 5 base mismatches37.

To annotate and categorize A. sinensis unigene function, we conducted NR annotation, and GO analysis. The NR animal protein sequences came from dozens of species; based on NR annotation, unigenes were assigned to GO classes using BLAST2GO38. With WEGO software39, the assigned GO terms were summarized into the three main GO categories, including biological process, molecular function and cellular component. Overall, all the unigenes were assigned COG classifications based on all the unigenes blasted against the COG database. The COG-annotated putative proteins were classified into 25 functional categories. To assess the quality of the transcriptome gene set, we evaluated the completeness of coding gene set using Benchmarking Universal Single-Copy Orthologs (BUSCO)40.

Differentially expressed gene analysis and RT-qPCR validation

To obtain the gene expression information, we mapped the high-quality reads of each sample to all transcripts using SOAP2 software41. After counting the number of mapped reads, estimated FPKM values were calculated through RSEM based on the mapping results. Principal component analysis (PCA) was performed for the 45 samples with the FPKM by the princomp function in the R package42. The 45 samples from the HPG in two stages were designed for 18 pairwise comparisons. The DEGs of these pairwise comparisons were evaluated by Noiseq software with probability no less than 0.843. The GO annotation methods are similar to those mentioned above in unigene annotations. Compared to the whole genome background, the DEG pathway enrichment analysis was identified by a significantly enriched pathway. After multiple testing for the p-value, pathways with a q-value ≤ 0.05 were defined as significantly enriched pathways.

To validate the DEGs, we performed RT-qPCR to evaluate the sequencing and data analysis. Total RNA was extracted using TRIzol reagent (Ambion) according to the manufacturer’s protocol. Approximately 1 μg of RNA was converted to cDNA. Then, the cDNA templates were reverse-transcribed with a PrimeScriptTM RT reagent kit with gDNA Eraser (Perfect Real Time) (TakaRa, Dalian, China). RT-qPCRs were performed on a My IQTM colour Real-time PCR Detection System (Bio-Rad, USA). Using β-actin as a reference gene normalized by the median expression, the relative expression levels of 10 target sex-related genes were calculated by the 2−ΔΔCt method44. RT-qPCR was performed on three biological replicates, and the data are presented as the mean ± SD. With Student's t-test, the difference was considered statistically significant at a p-value < 0.05.

Microsatellite markers

SSRs were detected in unigenes by MIcroSAtellite (MISA; http://pgrc.ipk-gatersleben.de/misa, version 1.0) software. The following types of SSRs were detected: mono-, di-, tri-, tetra-, penta- and hexa-nucleotide repeats, as well as compound SSRs [the sequence including more than one type of repeat units, e.g., (GA)n(TC)m]. Here, the standards of the repeat unit numbers were set as follows: mono-10, dimer-6, trimer-5, tetramer-5, pentamer-5, and hexamer-5.

Data Records

All raw reads of transcriptome sequences have been submitted to the NCBI Sequence Read Archive45. The assembled transcriptome data were deposited in NCBI’s GenBank46. The gene expression measurements were deposited in Gene Expression Omnibus (GEO)47. The transcriptome annotation of unigenes in the NR, NT, SwissProt, KEGG, COG, InterPro and GO database, differentially expressed genes in different comparisons, RT-qPCR test results and statistic of SSR motifs of the Acipenser sinensis unigenes can be found in our Figshare records48.

Technical Validation

A total of 3,255.19 M raw reads were generated by Illumina HiSeq 2000 and 4000 platforms. After filtering raw reads, 2,967.34 M high-quality reads (271.19 Gb high-quality data) were used for the transcriptome assembly, and all 45 samples were used for differential expression analysis. Sample sequencing statistics and high-quality data information are presented in Table 1. The statistics of the assembled unigene set of the 45 samples are listed in Online-only Table 1; the total number ranged from 40,374 to 155,858. The mean length and N50 length of the 121,952 unigenes were 1,384 bp and 2,208 bp, respectively. To annotate and categorize A. sinensis unigene function, we conducted NR annotation and GO analysis. A total of 76,268 unigenes matched the NR animal species dataset with BLASTx. Then, 53.86% of the unigenes were most similar to the genes of Lepisosteus oculatus, followed by Latimeria chalumnae (3.58%), Salmo salar (3.22%), Scleropages formosus (2.21%), Danio rerio (2.07%), Astyanax mexicanus (1.76%) and Clupea harengus (1.69%) (Fig. 1a).

Table 1 Raw data, clean data, quality and GC content of 45 transcriptomes from the Acipenser sinensis hypothalamus-pituitary-gonad (HPG) axis in two sex development stages.
Fig. 1
figure 1

Characteristics of Acipenser sinensis unigenes with a BLASTx search against databases. (a) Species distribution of the best BLASTx matches for each unigene. (b) The annotation of the Acipenser sinensis unigenes in the NR, InterPro, Swiss-Prot, COG and KEGG databases.

A total of 74.5% of unigenes (90,853) were aligned to 7 databases (NR, Swiss-Prot, KEGG, COG, InterPro, GO and NT). A total of 76,268 unigenes were annotated to NR, 78,088 to NT, 67,964 to Swiss-Prot, 63,920 to KEGG, 30,194 to COG, 59,417 to InterPro and 53,038 to the GO database (Fig. 1b). For the functional annotation and classification analyses, with the annotation data based on the classification by GO, the cellular process, the cell and cell part category were the most frequent GO classification groups (Fig. 2). In COG analysis, 30,194 unigenes were annotated and classified into 25 functional categories (Supplementary Fig. 1). The largest cluster was “the general function prediction only”, followed by “Cell wall/membrane/envelope biogenesis” and “Signal transduction mechanisms”. Signal transduction; Cancers: Overview, Global and overview maps; Endocrine system; and Immune system were the top five annotated pathways according to KEGG classification analysis (Supplementary Fig. 2).

Fig. 2
figure 2

Gene Ontology classification of the Acipenser sinensis transcriptome. A total of 49,015 unigenes with BLASTx matched against the animal NR database were classified into three main GO categories (biological process, cellular component, molecular function) and 58 sub-categories. The left-hand scale on the y-axis shows the detailed names of the sub-categories. The x-axis indicates the number of unigenes in the same category.

In BUSCO annotation analysis, the total number of Actinopterygii genes for evaluation was 4,584, and 85.7% of the total BUSCOs identified. In this evaluation data, the ‘Complete and single-copy BUSCOs’ was 42.5%, and the ‘Complete and duplicated BUSCOs’ was 43.2%.

Based on the gene expression data, PCA was performed for dimension reduction to extract a small number of representative features that can represent the effects of all genes. Herein, the PCA-based method on these data set succeeded in discriminating hypothalamus (H), pituitary (P) and gonad (G) from each other (Fig. 3a). However, the female sample displayed significant differences among replicates, especially for the female hypothalamus, possibly because the female samples came from different families. Then, Noiseq was used to calculate differential expression of two kinds of comparison purposes (18 pairwise comparisons including 9 in HPG and 9 in different sexes). Approximately 74.5–85.6% of the short reads for each individual were mapped to the final unigenes, indicating the good quality of the assembly results. In the female hypothalamus versus female pituitary (FH-VS-FP), 865 genes were upregulated, while 666 genes were downregulated. In the female pituitary versus female gonad (FP-VS-FG), 1,091 genes were upregulated, while 2,857 genes were downregulated. In the female hypothalamus versus female gonad (FH-VS-FG), 1,456 genes were upregulated, while 3,097 genes were downregulated. The number of DEGs greatly increased in the order of FH-VS-FP → FP-VS-FG → FH-VS-FG. A similar trend of gene expression in HPG also occurred in both males and individuals whose sexes could not be distinguished in stage I. Of the differentially expressed unigenes between males and females in the same kinds of tissues, 772 genes were upregulated, while 1,077 genes were downregulated in the unknown-sex hypothalamus versus male hypothalamus (UH-VS-MH); 3,696 genes were upregulated, while 13,685 genes were downregulated in the unknown-sex hypothalamus versus female hypothalamus (UH-VS-FH); and 4,981 genes were upregulated, while 17,211 genes were downregulated in the male hypothalamus versus female hypothalamus (MH-VS-FH). In addition to the comparisons above, the total results of pairwise comparisons of 9 kinds of tissues are described in Fig. 3b. To validate the repeatability and reproducibility of DEGs generated from annotated data, we re-amplified 10 sex-related genes by RT-PCR. Nine of the 10 genes had gene expression profiles similar to those from RNA-Seq, suggesting that the sex-related genes identified through RNA-Seq were highly accurate and reliable.

Fig. 3
figure 3

Principle component analysis (PCA) and different gene expression analysis in the Acipenser sinensis transcriptome. (a) PCA plot of 45 samples based on estimated expression values. (b) The statistics of different expression genes in different pairwise comparisons. A total of 18 pairwise comparisons among the HPG of different sexes were performed. (F: Female; M: Male; U: unknown; H: hypothalamus; P: pituitary; G: gonad).

The unigene set generated a total of 51,570 SSRs, with the largest number of SSR motifs of mono-nucleotide repeats (18,162), followed by di-nucleotide repeats (17,628) and tri-nucleotide repeats (12,799) (Fig. 4).

Fig. 4
figure 4

SSR distribution of Acipenser sinensis unigenes.

Usage Notes

For the first time, we reported the transcriptome resources of the integrated HPG of stage I and stage II in A. sinensis. The novel de novo assembly unigenes will provide the following: (i) abundant functional genes for further research on molecular evolution, genetic breeding and fish disease prevention; and (ii) an abundance of SSR markers in A. sinensis. In addition, the DEGs will offer possibilities (i) to understand the gene interaction and regulations of gonad development from stage I to II in A. sinensis, (ii) to explore the gene co-expression and sex differentiation in A. sinensis, and (iii) to reveal gene or tissue interaction models of the HPG axis in A. sinensis. These findings are also valuable for research on developmental regulation or gene expression patterns of tissue interaction, the identification of HPG- and sex-related genes, and developing SSR markers in other sturgeon species.