Background & Summary

The yellow-throated marten belongs to the genus Martes of the family Mustelidae and is named after its conspicuous yellow pelage on its chest and throat1. It is a voracious predator that feeds on several types of vertebrates, invertebrates, fruit, nectar, and food residue2. Unlike many mustelids, the marten generally moves in groups of two to three individuals2, which enables increased access to resources and reduces the risk of predation3. Given its preference for forested areas, it rarely appears in non-wooded environments4, as a consequence of which it may serve as a good indicator of forest ecosystem health. The marten performs several key roles in maintaining ecological balance, including spreading seeds5, and controlling the herbivore population size6, as a top-level predator in certain ecosystems4. The risk of extinction faced by the marten is low and the International Union for the Conservation of Nature (IUCN) classifies it as “Least Concern”7. However, rampant hunting, habitat loss, and other human activities pose substantial danger to the gradually decreasing marten populations7. Fortunately, certain protective measures, including legislation to counter these trends have been implemented in several countries, such as Myanmar8, Thailand9, South Korea6, and China10.

At present, research on the marten primarily focuses on its physical characteristics, behaviour, geographic range, and habitat. However, progress in molecular characterization, albeit slowly, has resulted in complete elucidation of its mitochondrial genome11,12, established phylogenetic relationships between species on the basis of mitochondrial and/or partial nuclear gene sequences13,14,15, and enabled population genetics analyses based on microsatellite markers16,17. Genetic and evolutionary studies on the marten have been limited by the sparse nature of available genomic resources. For instance, the marten is the only extant species of the genus Martes that is adapted for survival in areas spanning from boreal to equatorial regions and from sea-level to an altitude of 4,510 m7. The likelihood is that there is some genetic variation among populations of the marten occupying different habitats. Therefore, a meaningful analysis of population structure, and the molecular mechanisms of adaptive evolution among different marten populations at the genomic level will be highly valuable. We applied Oxford Nanopore and Hi-C technologies to generate a chromosome-level genome assembly of the marten, which will serve as a useful resource in evolutionary and population genetics studies on this animal, as well as in chromosome evolution studies on Carnivora.

Methods

Sampling and sequencing

The yellow-throated marten sample used for DNA and RNA sequencing was obtained from Chengdu, China. Muscle tissue was stored at −80 °C and used to construct Illumina, Nanopore, and Hi-C libraries. High molecular weight genomic DNA was extracted from muscle tissues using a Blood & Cell Culture DNA Midi Kit.

Short-insert-size (~400 bp) paired-end sequencing libraries were constructed using the Truseq Nano DNA HT Sample Preparation Kit and sequenced on the Illumina HiSeq X Ten platform to generate 150 bp paired-end reads. These yielded 1.58 billion reads, 236.83 Gb of raw sequence data, which covered 96.70-fold of the genome assembly (Table 1, Table S1). Nanopore libraries were constructed and sequenced on the PromethION sequencer. In total, 27.76 million reads, 264.40 Gb of raw sequence data were obtained, which was 107.96-fold coverage of the genome assembly (Table 1, Table S2). The mean read length and the N50 length were 9.53 kb and 17.43 kb, and the longest read covered 204.65 kb (Table S2). Hi-C libraries were constructed using MboI restriction enzyme and sequenced on the Illumina NovaSeq6000 platform in 150 bp PE mode. As a result, 257.31 Gb of Hi-C reads were obtained, which covered 105.07-fold of the genome assembly (Table 1, Table S3).

Table 1 Statistics of sequencing data generated in this study.

Additionally, RNA was extracted from seven tissues of the marten, including testis, stomach, kidney, pancreas, heart, spleen, and intestine. Transcriptome sequencing was performed on the Illumina Novaseq6000 platform, which yielded a total of 60.43 Gb of raw reads (Table 1, Table S4).

Genome size and heterozygosity estimation

Raw genomic Illumina sequencing reads were filtered using Fastp v0.12.618 to remove adaptors, duplications, and low-quality reads. The clean reads were subsequently used to estimate genome size, heterozygosity, and repeat content based on 21-mer frequency distribution analysis using Jellyfish v2.3.019 and GenomeScope v2.020. This resulted in the identification of 205,236,235,649 21-mers with a depth of 77 (Table S5). We therefore estimated that the genome of the marten is approximately 2,224.23 Mb in size, with a heterozygosity of 0.40% and a repeat content of 13.16% (Fig. 1, Table S5).

Fig. 1
figure 1

The 21-mer frequency distribution analysis for the marten genome based on Illumina paired-end reads. The observed 21-mer frequency distribution is shown in blue, whereas the fitted model is shown as a black line. The unique and putative error k-mer distributions are plotted in yellow and red, respectively.

De novo assembly of the marten genome

Sequencing data generated from the Nanopore platform were corrected (parameters: “reads_cutoff: 1k,seed_cutoff: 19k”) and assembled (parameters: default) using NextDenovo v2.0-beta.1 (https://github.com/Nextomics/NextDenovo). Further improvement in the accuracy of the assembly was ensured by performing four rounds of self-correction and three rounds of consensus correction using ONT reads and Illumina short reads with Nextpolish v1.0.521. The finally assembled genome was 2449.15 Mb in size with 215 contigs and a contig N50 of 68.60 Mb (Table 2). These findings closely mirror the genome size of Martes zibellina (2,420.68 Mb), a closely related species of the marten22. Further genome assembly summary statistics were computed using Gfastats v1.3.323 (Table 2).

Table 2 Summary statistics of the genome assembly.

Chromosomal-level scaffolding

Chromosome-level scaffolding was performed by Hi-C analysis at the Genome Center of Grandomics (Wuhan, China). The raw Hi-C data were primarily filtered using Hi-C-Pro v2.8.024. Subsequently, post quality control with Fastp, the clean Hi-C data were mapped to the genome assembly of the marten using Bowtie2 v2.3.225 to get the unique mapped paired-end reads. As a result, 608.63 million uniquely mapped pair-end reads were obtained (Table S6), of which 83.19% were valid interaction pairs (Table S7). Combined with the valid Hi-C data, LACHESIS24 was applied to produce a chromosomal-level genome. We further adjusted the misassembled contigs manually based on the interaction strength among the contigs and a linkage map using Juicebox26. The final outcome entailed 2,419.20 Mb (98.78%) of assembled sequences that were anchored and orientated onto 21 chromosomes, ranging from 3.97 Mb to 219.65 Mb in length (Fig. 2, Table 3). Subsequently, the software ggplot2 in the R package was used to generate a genome-wide Hi-C heatmap to evaluate the quality of the chromosomal-level genome. The heatmap of chromosome crosstalk illustrated that the chromosomal-level genome was complete and robust (Fig. 3).

Fig. 2
figure 2

Features of the marten genome. The tracks from outside to inside are 21 chromosomes, repeat sequences abundance (blue), GC content (purple), gene abundance (red), collinear regions (each line connects a pair of homologous genes). The figure used for circos plot was generated using TBtools59.

Table 3 Statistics of the chromosomal-level genome.
Fig. 3
figure 3

Genome-wide all-by-all Hi-C interaction among 21 chromosomes of the marten. The heatmap indicates that intra-chromosome interactions (blocks on the diagonal line) are stronger than inter-chromosome interactions. The shading gradient on the right represents the intensity of chromosomal interactions, which ranges from white (low) to red (high).

Genome quality assessment

Complementary methods were employed to evaluate the quality of genome assembly. First, the Illumina reads and Nanopore reads were aligned to the marten genome using BWA v0.7.12-r103927 and Minimap2 v2.1728, respectively. The results showed that 99.85% of the Illumina reads and 99.74% of the Nanopore reads could be mapped to the genome, with a coverage rate of 99.87% and almost 100%, respectively (Table S8, Table S9). Second, the completeness of the genome was evaluated by BUSCO v4.0.5 with a core gene set, referring to mammalia_odb1029. As a result, 93.37% (8,614 of 9,226) of the complete BUSCO genes were identified, of which, 93.03% (8,583 of 9,226) were single copy and 0.34% (31 of 9,226) were duplicated (Fig. S1). Third, Merqury v1.330 was used to assess the consensus quality value (QV) and k-mer completeness of the genome assembly, which were found to be 43.75 and 94.95%, respectively (Table S10, Fig. S2).

Repeat annotation

Homology-based and ab initio prediction methods were used to identify repetitive sequences in the marten genome. The homology-based analysis was performed using RepeatMasker v4.1.031 with the Repbase database32. For ab initio prediction, RepeatModeler v2.0.133 was utilized to construct a de novo repeat library, which was subsequently employed to predict repeats with RepeatMasker. We identified 973.18 Mb of repetitive sequences, accounting for 39.74% of the marten genome (Table 4). Among these, long interspersed elements (LINE) that accounted for 26.13% of the whole genome were the most abundant (Table 4). These results are supported by similar findings in published mustelids genomes22,34,35.

Table 4 Statistics of repetitive elements in the marten genome.

Prediction and functional annotation of protein-coding genes

We predicted protein-coding genes in the marten genome through integrating three different strategies: ab initio prediction, homology-based prediction, and transcriptome-based prediction. First, Augustus v2.5.536, GlimmerHMM v3.0.437, Geneid v1.4.438, and Genscan v1.039 were adopted to ab initio gene prediction with internal gene models. Second, protein sequences of seven species including, Bos taurus, Canis lupus familiaris, Enhydra lutris, Homo sapiens, Mustela erminea, Mustela putorius furo, and Mus musculus, as the templates of protein homology-based prediction were downloaded and aligned against the marten genome using TblastN v2.2.2640 with an E-value ≤ 1e−5. The potential gene structure of each alignment was then predicted by GeneWise v2.4.141. Third, transcriptome data were aligned to the marten genome with TopHat v2.1.142 and the gene structures were predicted by Cufflinks v2.2.143. Finally, a non-redundant gene set was generated via integration of the three respective annotation files that were assigned different weights (ab initio prediction was “1”, homology-based prediction was “5”, and transcriptome-based prediction was “10”) in EVidenceModeler v1.1.1. PASA v2.3.3 was used to update the gene models by identifying untranslated regions to generate a final annotation44.

Functional annotation of the protein-coding genes was accomplished using eggNOG-Mapper v245, a tool that enables rapid functional annotations of novel sequences on the basis of pre-computed orthology assignments, against the EggNOG v5.0 database46.

Overall, we obtained 20,464 protein-coding genes in the marten genome, of which, 20,322 (99.31%) were successfully annotated. Additionally, we compared the distribution of mRNA length, coding DNA sequence (CDS) length, exon length, intron length and exon number in the marten genome with that of seven other mustelids, including, Enhydra lutris, Lontra canadensis, Lutra lutra, Mustela erminea, Meles meles, Mustela putorius furo, and Neovison vison (Table 5, Fig. 4). The results revealed a higher percentage of shorter mRNA in the marten genome than that in the genomes of the seven other mustelids (Fig. 4a). Further, short intronic lengths (about 0~75 bp) in the marten genome had a distribution pattern that was distinct from the seven other mustelids (Fig. 4d). One of the possible reasons is that there is slight deviation in the results of genome assembly and/or annotation between different species.

Table 5 The comparisons of gene elements in the marten genome with seven other mustelids.
Fig. 4
figure 4

The comparisons of gene elements in the marten genome with seven other mustelids. (a) mRNA length distribution and comparison with seven other mustelids. (b) CDS length distribution and comparison with seven other mustelids. (c) Exon length distribution and comparison with seven other mustelids. (d) Intron length distribution and comparison with seven other mustelids. (e) Exon number distribution and comparison with seven other mustelids.

Data Records

The genomic Illumina sequencing data were deposited in the Sequence Read Archive at NCBI SRR2145207547. The genomic Nanopore sequencing data were deposited in the Sequence Read Archive at NCBI SRR2142679148. The transcriptome Illumina sequencing data were deposited in the Sequence Read Archive at NCBI SRR21460068-SRR2146007449,50,51,52,53,54,55. The Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI SRR2143040856. The final chromosome assembly were deposited in the GenBank at NCBI JAODOS00000000057. The final chromosome assembly, gene structure annotation, repeat annotation, and gene functional prediction were deposited in the Figshare database58.

Technical Validation

DNA quantification and qualification

DNA degradation and contamination was monitored on 1% agarose gels. DNA purity was detected using NanoDrop One UV-Vis spectrophotometer. DNA concentration was measured by Qubit Fluorometer.

RNA quantification and qualification

RNA degradation and contamination was monitored on 1% agarose gels. RNA concentration was measured by Qubit Flurometer. RNA integrity was assessed using Agilent 2100 Bioanalyzer.

Quality filtering of Illumina data

To make sure the reads reliable in the following analyses, we used Fastp to elevate the quality of raw reads generated from the Illumina platform. The data were filtered out as follows:

  1. (1)

    removing the reads with more than 10% of Ns;

  2. (2)

    removing the reads with a quality score less than 20 for 20% of bases;

  3. (3)

    removing the reads with adapter sequences;

  4. (4)

    removing the reads with duplications.