Background & Summary

In recent decades, there has been a growing interest in the body size of proboscideans, as it is closely associated with a variety of biological functions due to its high correlation with mass1. Currently, there are two families within Proboscidea, comprising three species: the Asian elephant, the African savannah elephant, and the African forest elephant (Loxodonta cyclotis). The population of proboscis animals has been rapidly decreasing due to factors like poaching and hunting. As a result, they are now classified as critically endangered and endangered on the IUCN red list (https://www.iucnredlist.org/). People’s preference for ivory has also caused some unique evolutionary changes in proboscis animals, such as a substantial increase in the proportion of female African elephants without tusks and a gradual decrease in the size of tusks in male African elephants2. In addition, the swift expansion of economic crop cultivation areas has led to habitat fragmentation, emerging as a significant peril to wild populations3. A growing quantity of elephants are coming out of the forest and regularly exploring villages and residential areas. An increasing number of elephants are coming out of the forest and frequently venturing into villages and residential areas. As a result, there have been occasional occurrences of crop damage, as well as harm to humans and animals. The escalating human-elephant conflict poses a significant challenge to conservation efforts and is detrimental to the healthy development of the elephant population. Additionally, variations in the population of large mammals exert a greater impact on other animals within their habitat. Therefore, the protection and conservation of elephants has become a focus of ecological diversity efforts. In the era of transitioning from conservation genetics to conservation genomics4,5,6,7, high-quality reference genome is of vital importance to improve the evaluation of the full spectrum of genomic diversity, inbreeding and outbreeding depression, local adaptation and genetic loads8,9,10,11. Furthermore, this genome assembly will provide a valuable resource for studying the ecology and evolution of specific species12,13.

Rapid advances in high-throughput sequencing technologies over the past decade have opened new avenues for addressing the genetic basis of natural population adaptation and speciation14. The use of genetic data has proven valuable in delineating taxa that cannot be identified based on morphology alone15,16,17. In the case of endangered animals, the analysis of haplotype can assist in detecting hidden signals of inbreeding depression, providing crucial insights for conservation initiatives18. Therefore, obtaining high-quality elephant genomes will be important for elucidating the genetic mechanisms underlying the species’ distinct biological characteristics and complexity, as well as for informing conservation strategies aimed at preserving these species. Although the draft genomes of the two elephants have been released before19,20, the recent HiFi sequencing technology greatly improves the genome quality and supplies haplotype-resolved reference genome20,21,22.

In this study, we generated two chromosome-level and haplotype-resolved genome assemblies of the Asian Elephant and African Savannah Elephant using PacBio HiFi long-reads, DNBSEQ short-reads, and Hi-C sequencing data. The assembled genome sizes were 3.38 Gb and 3.31 Gb for the Asian elephant and African savanna elephant, with the N50 length of 130 Mb and 122 Mb, respectively. These results are significantly improved compared to the published genomes14,15. Approximately 97% of the assembled sequences were anchored to 29 pseudochromosomes. The collinearity analysis of the chromosome-level genomes of the two species is consistent with the results of published karyotype studies23, which verifies the accuracy of genome assembly in this study. Using a combination of de novo prediction, homology-based search, and transcriptome-assisted method, we annotated 22,177 and 22,142 protein-coding genes in genomes of the Asian elephant and African savanna elephant, respectively. Additionally, we identified ~ 9 Mb of Y-linked sequences from both of the two elephant genomes by combining the sex-determining region (SRY) and chromosomal synteny evidence. the two high-quality elephant reference genomes produced in this study are a valuable resource for future research on the ecology, evolution, biology, and conservation of Proboscidea species. The two high-quality elephant reference genomes in this study are a valuable resource for future research on ecology, evolution, biology and conservation for Proboscidea species. The genomes hold the potential to delve into a diverse array of subjects, offering an opportunity to enhance our comprehension of these incredible creatures and bolster efforts for their conservation.

Methods

Sample collection and ethics statement

Blood samples from E. maximus and tissue samples from L. africana were provided by the Asian Elephant Research Center of National Forestry and Grassland Administration of China and Harbin North Forest Zoo, Heilongjiang Province, China. A portion of the fresh sample (blood sample from an Asian elephant, and muscle tissue sample from an African savannah elephant) was taken out and treated with formaldehyde for the cross-linking of the chromatin, and then stored at −80 °C for Hi-C sequencing. The remaining sample was immediately frozen in liquid nitrogen for 30 min and then transferred to the −80 °C refrigerator for PacBio sequencing, DNBSEQ sequencing and RNA-seq sequencing. Sample collection, follow-up experiments and research design in this study were all approved by the Institutional Review Board of BGI (BGI-IRB E22017).

Nucleic acid extraction, library construction and sequencing

Total genomic DNA was extracted using a Dneasy Blood and Tissue Kit (Qiagen, USA) for whole genome sequence (WGS) library. Total RNA from blood and muscle tissue were extracted using Trlzol reagent (Invitrogen, USA), and cDNA libraries were reverse-transcribed using 200–400 bp RNA fragments (Supplementary table 1). The concentration of nucleic acid was detected by Qubit 2.0 Fluorometer (Life Technologies, USA), and RNA integrity was evaluated using an Agilent 2100 Bioanalyzer System (Agilent, USA). These two types of libraries were subjected to paired-end sequencing using a DNBSEQ-T1 sequencer (MGI tech, Shenzhen, Guangdong, China). A 15k library was constructed by using high-quality DNA samples (main band > 30 kb) and sequenced with a Pacbio Sequel II platform (Novogene, Tianjin, China). Low-quality reads and sequencing-adaptor-contaminated reads were removed. Finally, a total of ~100 GB clean data were used to assemble the two genomes (Table 1). Cross-linked samples were prepared with dnpII restriction endonuclease for Hi-C library and PE-sequenced by Illumina Hiseq.

Table 1 Sequencing stats.

Genome assembly

To estimate the genome size, a total of ~100 Gb DNBSEQ short reads were used for analysis by kmerfreq (v5.0)24. The final estimated genome size is 3.44 Gb for E. maximus and 3.50 Gb for L. africana (Supplementary Fig. 1). The heterozygous and haplotype draft genomes of the two elephants were assembled by using Hi-C and PacBio sequencing data in hifiasm (v0.16.1)25. In the genome polishing stage, minimap2 (v2.17)26 and NextPolish (v1.4.0)27 were mainly used to improve the accuracy of single bases by three rounds of HiFi reads and two rounds of DNBSEQ reads. Redundancy removal of genomes was performed by Purge_dups (v1.2.5)28. The burrows-Wheeler Aligner (BWA, v0.7.17) mem algorithm29 was used for Hi-C sequencing reads mapping to the primary genome. The Juicer (v1.5)30 was used for Hi-C data quality control, and the 3d-DNA pipeline (v190716)31 was finally used to concatenate and review the scaffolds to the chromosome-scale genome. Finally, two hybrid genomes composed of 29 pseudo-chromosomes and two sets of haplotigs composed of 28 pseudo-chromosomes were obtained, and the average Hi-C mounting rate reached 97.28 ± 0.60% (Fig. 1, Supplementary Tables 3, 4). Basic assembly statistics, reaching 130 Mb and 122 Mb for Scaffold N50, show a significant improvement over published Elephant genomes (Table 2, Supplementary table 4)14,15.

Fig. 1
figure 1

Characteristics of the chromosome-scale genomes of the Asian (Elephas maximus) and African Savannah Elephant (Loxodonta africana). (a) Circos plot of genome assembly. A) Pseudo-chromosomes; B) gene density; C) GC content; D) repeat number; E) sequencing depth (~100 Gb DNBSEQ reads aligned to the genome); F) chromosome synteny (keep the longest 25,000). (b) Hi-C intra-chromosomal contact map of the L. africana haploid genome assembly. (c) Hi-C intra-chromosomal contact map of the E. maximus haploid genome assembly. Hi-C interactions within and among chromosomes were drawn based on the chromatin interaction frequencies between pairs of genomic regions.

Table 2 Comparison of the assembly statistics among the genomes assembled in this study (EmaxG and LafrG) and the previously published elephant genomes19,20.

By identifying the sex-determining region of Y-chromosome (SRY) and examining the chromosomal synteny between species using (MUMmer, v4.0.0rc1)32, we also discovered two Y-linked regions of ~9 Mb each, which were verified on the DNBSEQ reads depth distribution (Supplementary Fig. 2).

Repeat regions prediction

Transposable elements (TEs) and other repetitive elements were identified using a combination of homology-based and de novo approaches. For the homology-based approach at both the DNA and protein levels, the genome assembly was aligned to the known repeat database REPBASE (v21.01) using RepeatMasker33 (v4.0.5), RepeatProteinMask33 and Tandem Repeats Finder (TRF)34 (v4.07b). For the de novo-based approach, RepeatModeler35 (v2.0) and LTR_retriever34 were used to construct a de novo repeat library. We found that the Asian elephant and African savanna elephant genomes contained 69.16% and 70.32% TEs, respectively, with the proportions of each type being similar across these two species (Table 3, Supplementary Tables 5, 6). Long Interspersed Nuclear Elements (LINEs) accounted for most TEs, occupying about ~54% of the genome. All repetitive elements were masked for gene annotation.

Table 3 Statistics of the repeat elements.

Annotation of protein-coding genes

We combined homology-based, de novo and transcriptome-based methods to predict assembled gene content. In a homology-based approach, GeneWise36 (v2.4.1) was used to map 14 closely related or high-quality protein sequences, including Homo sapiens, Mus musculus, Suncus etruscus, Equus caballus, Felis catus, Phyllostomus discolor, Sus scrofa, Choloepus didactylus, Dasypus novemcinctus, Trichechus manatus latirostris, Orycteropus afer afer, Elephantulus edwardii, Echinops telfairi, and Chrysochloris asiatica, available in the NCBI database, to two assembled genomes with an E-value cutoff of 1e-5. In the de novo method, we run the repeat-masked genome using Augustus37 (v3.0.3). In the transcriptome-based method, transcripts were assembled using StringTie38 (v1.3.3b) based on clean RNA-seq data. The final protein-coding gene set was generated using the MAKER pipeline39 (v3.01.03) by combining high-quality homology-based, de novo and RNA-seq supported genes. Based on the above methods, 22177 genes were annotated in the Asian elephant genome, while 22142 genes were annotated in African elephant genome (Table 4).

Table 4 Protein-coding gene statistics.

Annotation of gene function

Functional annotations of protein-coding genes were carried out using BLAST (e-value cut-off of 1e-5) against publicly available databases, including the Swiss-Prot, TrEMBL, Gene ontology (GO) terms and KEGG database. InterProScan40 (v5.52–86.0) was used to predict domains and motifs. 99.81% of the genes in the gene sets of both elephant species were fully annotated in the five above-mentioned databases (Fig. 2a,b, Supplementary Table 7). In addition, noncoding RNA (ncRNA) genes, including miRNA, tRNA, snRNA and rRNA, were predicted in the assembled genome. tRNA genes were identified using tRNAscan-SE41 (v1.3.1). snRNA and miRNA genes were detected by searching the reference genome sequences against the content of the Rfam database (Release 12.0) using BLAST (Supplementary Table 8).

Fig. 2
figure 2

Genome Annotation Statistics. (a) Venn diagram of E. maximus gene counts with homology or functional classification by each method. (b) Venn diagram of L. africana gene counts with homology or functional classification by each method. (c) A phylogenetic tree based on single-copy genes from 16 species showing the estimated divergence time (Silhouette from https://www.freevectors.net/free-vectors/animals).

Phylogenetic comparative analysis

We performed a comparative genomic analysis between the E. maximus, L. africana and 14 reference species used in the previous step, among which Homo sapiens was set as an outgroup. First, the longest transcript of each gene from each species was used to perform all-to-all BLAST42 (v2.2.26) analysis with the parameter “-p blastp -m8 -e 1e-5 -F F”. Then, genes were clustered using Treefam43 (v1.4) pipeline with hierarchical clustering on a sparse graph. Finally, 2365 single-copy genes were identified (Supplementary Fig. 3). These single-copy genes were used to construct a Maximum-Likelihood (ML) phylogenetic tree using IQTREE44 (v1.6.12), with the best-fit evolutionary substitution model (GTR + F + R4) using ModelFinder45. To estimate the divergence time between C. versicolor and the other 14 species, we used MCMC Tree46 (v4.5) implemented in the PAML package. Sequences for 2365 single-copy genes were used as the input file for MCMC Tree, and multiple fossil times were u from Timetree (http://www.timetree.org/). The Markov chain Monte Carlo (MCMC) process was run for 1,500,000 iterations of 150 after a burn-in of 500,000 iterations with a sampling frequency (Fig. 2c).

Data Records

The chromosome-scale genome sequences of two elephant species are available at the NCBI GenBank under the accession number GCA_033060105.147 (EmaxG) and GCA_033060095.148 (LafrG), and the haplotype-resolved genome sequences are also available at NCBI (EmaxH1: GCA_032718755.149, EmaxH2: GCA_032718585.150, LafrH1: GCA_032717405.151, LafrH2: GCA_032717415.152). The annotation files generated in the current study are available in the figshare database53. The raw data that support the findings in this study have been deposited into National Genomics Data Center (NGDC)54 Genome Sequence Archive (GSA)55 database with the accession number CRA01222156 under the BioProject accession number PRJCA018778. All the above sequencing and analysis data in this study is also available in CNGB Sequence Archive (CNSA)57 of China National GeneBank DataBase (CNGBdb)58 with accession number CNP0004258.

Technical Validation

The completeness of the elephant genomes was evaluated by the BUSCO59 (v5.2.2) analysis with mammalia_odb10 data set, scoring at 95.1 ± 1.1% (Table 5). The Merqury60 (release 20200430) k-mer analysis and PacBio long reads’ alignments (genome regions with PacBio long-read coverage over 10× were considered as accurate assembled regions61) were used for evaluating the genome assembly accuracy of this genome (Table 5, Supplementary Table 9). The completeness of the genome and gene set was also evaluated using the database of mammalia_odb10 through BUSCO. The two chromosome-level genomes scored 96.3% and 95.2%, respectively (Supplementary Table 10). The NUCmer program from the MUMmer32 (v4.0.0rc1) was performed for Syntenic blocks screening, and these identified syntenic blocks were filtered by using the delta-filter program from the MUMmer32 (v4.0.0rc1) with parameters “-i 90 -l 5000”, to assist in demonstrating the haplotype effect (Supplementary Fig. 4).

Table 5 Summary of genome quality assessments.