Background & Summary

Coral reefs are one of the most diverse and productive ecosystems, which support more than one-quarter of marine life with less than 2% of the ocean floor1. In recent decades, reef-building corals are threatened by anthropogenic climate change such as ocean warming and acidification2,3, as well as local stressors such as overfishing, pollution, and coastal development4,5,6. The world has lost almost 50% coral coverage since the 1950s7. With projected continued degradation of coral reefs, 90% of coral reefs may disappear in the next few decades8,9,10.

The blue corals (Heliopora) are the only genus of octocorals that form a massive hard skeleton and symbiosis with zooxanthellae like scleractinian corals11 (Fig. 1a). Due to their massive reef structure, blue corals are an important reef-building species in the Indo-West Pacific11,12,13,14. H. coerulea, with a characteristic blue skeleton, had long been regarded as the only extant member of the family Helioporidae, until the recent description of H. hiberniana (with white skeleton) in northwestern Australia15. Recent studies based on RAD-seq and Genotyping by sequencing in blue corals revealed there are also two distinct lineages of H. coerulea in the Kuroshio Current region16,17. Based on fossil records, the genus Heliopora were once widely distributed throughout the warm shallow oceans in the early Cretaceous11,18 (<120 million years ago, MYA). To date, H. coerulea is distributed in the shallow warm waters of the Indo-Pacific oceans11,17.

Fig. 1
figure 1

(a) A photograph of the blue coral Heliopora coerulea in the field (Photo credit: Benny K.K. Chan). (b) Kmer-21 histogram generated using Illumina reads. Genome size and heterozygosity rate were estimated using GenomeScope226.

Heliopora coerulea is known to survive through bleaching events better than most scleractinian corals15,19,20. Recently, this species has been reported to expand from the tropics to the high-latitude Tsukazaki, Japan21. A shift of dominant taxa from scleractinian corals to H. coerulea has been reported in reefs of Ishigaki island, Japan22 and the South China Sea side of the Philippines14,23. In addition, laboratory experiments showed that H. coerulea had a higher growth rate when exposed at 31 °C – a temperature that would usually trigger the bleaching of scleractinian corals7,8,9 – than at 26 °C24.

To facilitate molecular studies of blue corals to understand their thermal resistance, here, we report a draft genome assembly of H. coerulea generated using long-read PacBio HiFi sequencing (Tables 1, 2). The assembled genome size of H. coerulea is 429.9 Mb, consisting of 769 contigs with an N50 of 1.42 Mb, GC content of 37.4%, and 55.6% repeat elements (Fig. 2). The genome contains a total of 27,108 protein-coding genes with 95.7% functional annotated by BLASTp search against the published protein databases. In addition, RNA sequencing shows that the H. coerulea genome contains 6,225 lncRNAs and 79 miRNAs.

Table 1 A summary of Heliopora coerulea genome, mRNA, lncRNA, and miRNA sequencing data.
Table 2 Statisitcs of assembled genome after filtering with binning, BLAST, and heterozygous contigs.
Fig. 2
figure 2

Snail plot visualization summarizing metrics of the Heliopora coerulea genome including the length of the longest contig (9.92 Mb; red line), N50 (1.42 Mb; dark orange), base composition, BUSCO completeness, and repeat content.

Methods

Sample collection

The blue coral was collected by SCUBA at 5 m depth from Green Island, Taiwan (22°40′37′′N 121°28′23′′E) in April 2018. Coral fragments were transported in seawater to Biodiversity Research Center, Academia Sinica, Taipei, where they were kept in a 5 L aerated aquarium. To avoid contamination by bacteria or algae in the water, the coral fragments were rinsed several times in Milli-Q water immediately prior to DNA and RNA sampling. Coral fragments were immediately fixed in liquid nitrogen for DNA extraction and genome sequencing, whilst tissues were fixed in RNAlater (Invitrogen, CA, USA) for RNA sequencing. All samples were stored at −80 °C in a freezer until subjected to extraction.

Genomic sequencing

Genomic DNA was extracted from the coral tissue using the CTAB method25. DNA quality and quantity was measured using agarose gel electrophoresis and a Qubit fluorometer (Thermo Fisher Scientific, MA, USA), respectively. DNA samples were submitted to Novogene (Beijing, China) for library preparation and whole genome sequencing (Table 1). Briefly, 1 µg DNA was used to construct two libraries with 350-bp and 500-bp insert sizes using the NEBNext DNA Library Prep Kit (New England Biolabs, MA, USA), and sequenced on an Illumina HiSeq X Ten sequencer to generate 122.4 Gb paired-end reads with a read length of 150 bp. In addition, 10 µg DNA was used to construct a HiFi SMRTbell library using the SMRTbell Express Template Prep Kit 2.0, and sequenced on a PacBio Sequel II sequencer. Total of 31.8 Gb high-quality HiFi reads were produced using the circular consensus sequencing (CCS) mode on the PacBio long-read platform.

RNA sequencing

Total RNA was extracted from the coral tissue using TRIzol reagent (Thermo Fisher Scientific, MA, USA) by following the manufacturer’s protocol. The quality of the RNA samples was determined with agarose gel electrophoresis and the quantity was determined using a Qubit fluorometer (Thermo Fisher Scientific, MA, USA). RNA samples were submitted to Novogene (Beijing, China) for mRNA, long non-coding RNA (lncRNA), and microRNA (miRNA) sequencing (Table 1). mRNA library was constructed using Illumina NEBNext Ultra RNA Library Prep Kit (New England Biolabs, MA, USA) and sequenced using an Illumina HiSeq X Ten sequencer to produce 150-bp paired-end reads. For lncRNA, ribosomal RNA was depleted from total RNA using Epicentre Ribo-Zero rRNA Removal Kit (Epicentre, WI, USA). The cDNA libraries were prepared using the NEBNext Ultra RNA Library Prep Kit (New England Biolabs, MA, USA), and sequenced on an Illumina NovaSeq platform under the paired-end mode to produce 150-bp reads. In addition, miRNA libraries were prepared using the NEBNext Multiplex Small RNA Library Prep Kit (Illumina, CA, USA) and sequenced on an Illumina NovaSeq platform to produce 50-bp single-end reads.

Estimation of genome size

The genome size of H. coerulea was estimated using GenomeScope v2.0 with Illumina data26. Adaptors and low-quality reads (quality score <30, length <40 bp) of the Illumina data were trimmed with Trimmomatic v0.3827. To eliminate the zooxanthellae and prokaryotic reads, Illumina data were further filtered using bbmap.sh v39.01 (https://sourceforge.net/projects/bbmap/) against the Symbiodiniaceae genomes (Symbiodinium minutum, S. microadriaticum, S. kawagutii, and S goreaui) from ReefGenomics database (http://reefgenomics.org/) and NCBI Prokaryotic Refseq genomes with default settings. A total of 88.7 Gb Illumina reads were returned after quality filtering, and 77.9 Gb (87.8%) of them were from coral host. The clean Illumina data were used to generate a 21-kmer histogram using jellyfish v2.2.028, and then characterized using GenomeScope v2.0, which predicted the genome size of 428.2 Mb and heterozygosity of 0.73% at a k-mer size of 21 (Fig. 1b).

Genome assembly

De novo assembly of HiFi reads (N50 of 14.0 kb and mean length of 13.5 kb; Table 1) were performed using nextDenovo v2.5.0 (https://github.com/Nextomics/NextDenovo) under default settings. Algal and microbial sequences were removed by binning genome assembly with MetaBAT2 v2.1529, and BLASTn v2.11.0 + search against the 14 cnidarian genomes in Table 4, four Symbiodiniaceae genomes from ReefGenomics database (http://reefgenomics.org/), and NCBI Prokaryotic Refseq genomes with an E-value threshold of 1e-20. The initial assembly generated 1,309.7 Mb metagenome sequences (Table 2). After binning, a total of 170 bins were identified and the “Bin167” with 600.2 Mb and >100X coverage of Illumina data was selected (Table 2 and S1). BLASTn analysis filtered the potential symbiont sequence and resulted in the 586.0 Mb genome with 2,248 contigs. Possible alternative heterozygous contigs were further eliminated using Purge Haplotigs v1.1.23030 (Table 2). The completeness of the final genome assembly was assessed by analyzing the Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.5 scores against the databases eukaryota_odb10 and eukaryota_odb10 under the genome mode31. QUAST v5.2 was used to assess the assembly statistics32. The total assembled size of the genome is 429.9 Mb in length and the N50 is 1.42 Mb (Table 3; Fig. 2).

Table 3 Genome assembly and annotation statistics of Heliopora coerulea.

In addition, the mitogenome of H. coerulea was assembled with Illumina clean reads using Norgal v1.0 under the default settings33, and annotated using MITOS2 online34 and tBLASTn v2.11.0 + search against the published H. coerulea MT genome (GenBank: OL616236). The H. coerulea mitogenome is 18,957 bp in length with 14 protein-coding genes (Fig. 3), which is 100% identical with OL616236 in GenBank.

Fig. 3
figure 3

Mitogenome map of Heliopora coerulea. The outer circle shows the genes with the plus strand inside and minus strand outside. The GC content is plotted in the second inner circle at 50-bp sliding windows, depicted in dark blue.

mRNA annotation

The protein coding genes of the H. coerulea genome were predicted using MAKER v3.0 pipeline35 according to Ip et al.36. In brief, repeat contents in the genome were identified using RepeatMasker v4.1.2-p1 (http://www.repeatmasker.org/; settings: “-e rmblast -s -gff”) with RepBase library version 2018102637 and species-specific repeat libraries in RepeatModeler v2.0.338 under the “LTRStruct” option and the default setting for other parameters. A total of 239.1 Mb (55.6%) of the H. coerulea genome consists of repetitive sequences, including 30.6% transposable elements, 21.8% unclassified repeats, and 3.1% simple repeats and low complexity sequences (Table 3 and Fig. 2).

Raw mRNA reads were trimmed using Trimmomatic v0.3827 (quality score <30, length <40 bp). The clean reads were de novo and genome-guided assembled using Trinity v2.5.139 under the default settings. Cnidaria protein sequences from UniProt database were used as protein evidence. Augustus v3.440 and SNAP v2006-07-2841 were used for ab initio gene prediction. All predicted gene models were integrated into a consensus weighted annotation with EVidenceModeler v1.1.142 under the default settings in Maker3. In addition, PASA v2.4.1 was used to improve the Maker result using the de novo transcriptome43. Finally, we obtained 27,108 predicted protein-coding genes with an N50 of 1,754 bp (Table 3).

The BUSCO completeness of predicted gene models was assessed against eukaryota_odb10 and metazoa_odb10 datasets31 under the protein mode. The predicted genes were functionally annotated using Diamond v2.0.13.151 BLASTp44 against UniProt and Swissport databases under the “ultra-sensitive” option and an E-value threshold of 1e-5. Gene functional annotation was conducted using eggNOG-mapper v245 for Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Pfam domain.

lncRNA annotation

The raw lncRNA reads were filtered to remove adapter and low-quality reads (quality score <30, length <40 bp) using Trimmomatic v0.3827. The clean lncRNA reads were mapped to the H. coerulea genome using HISAT2 v2.1.046 under the default settings. The resulting bam files were then assembled into transcript models using StringTie v1.3.4d47 under the default settings. The assembled transcripts were processed through FlExible Extraction of LncRNAs (FEELnc) v0.2.148 for lncRNA identification and classification. Briefly, the script FEELnc filter.pl was used to remove transcripts with one exon, a size < 200 bp, and overlapping with predicted protein-coding regions. The coding potential score of each candidate transcript was calculated using the script FELLnc_codpot.pl under the shuffle mode. Finally, the FEELnc_classifier.pl was used to classify potential lncRNA with respect to the localization and the direction of transcription of nearby protein-coding genes. A total of 6,225 lncRNA genes were predicted in the H. coerulea genome (Tables S2, S3).

miRNA annotation

miRNA analysis was conducted according to Ip et al.36. Briefly, raw miRNA reads were trimmed with fastp v0.20.049 under the settings of length_required = 18, max_length = 35, unqualified_percent_limit = 30, n_base_limit = 0. The clean reads were then combined and mapped to the genome using the mapper.pl script in miRDeep2 v2.0.1.250 using bowtie v1.2.251. miRNAs were predicted using the miRDeep2.pl script in miRDeep2 with the Cnidaria mature miRNAs from miRBase v22.152. The predicted miRNAs were filtered with a miRDeep2 score ≥ 4, star (complementary) and mature read count ≥ 5, and a significant Randfold p-value. The target genes of miRNAs were predicted using miRanda v3.3a53 with a miRanda score ≥ 140, a dimer binding free energy < −5 kcalmol−1, and strict 5′ seed pairing. In total, we detected 79 miRNA candidates ranging from 20 to 24 nt in length, and 10,636 mRNAs were predicted as their potential targets (Tables S4, S5).

Phylogeny, divergence, and gene family analyses

Orthologous groups among H. coerulea and 13 anthozoans with the outgroup species Hydra vulgaris (details in Table 4 and Table S6) were identified using OrthoFinder v2.5.4 under the “diamond_ultra_sens” option54. A total of 407 single-copy genes were aligned using MUSCLE v3.8.3155 and trimmed using TrimAL v1.456. The aligned sequences with 91,426 amino acid positions and 1.1–13.9% gaps were concatenated for phylogenetic analysis using a maximum-likelihood method implemented in IQ-TREE v2.1357, with the best model of Q.insect + F + I + G4 and 1000 bootstrapping replicates. MCMCtree implemented in PAML v4.9h58 was used to estimate divergence times using the burn-in, sample frequency and number of samples of 10000000, 1000 and 10000, respectively. The node calibration among cnidarians was based on fossil records (i.e., ~55 MYA for Acropora59, ~145 MYA for Helioporacea18, ~540 MYA for Hexacorallia60) and TIMETREE database61 (i.e., Edwardsiidae for 280 – 490 MYA, Anthozoa for 520 – 740 MYA). Using the orthologous results, we performed the gene family expansion and contraction for each node using CAFÉ v4.262. These analyses revealed that H. coerulea is sister to the soft coral Dendronephthya gigantea, which split during Triassic (~216 MYA, 95% confidence interval of 157–301 MYA; Fig. 4). This D. gigantea + H. coerulea clade is then sister to the Hexacorallia clade, consistent with a previous phylogenetic analysis of 234 anthozoans63. Gene family analysis detected 167 expanded and 61 contracted gene families in H. coerulea (Fig. 4; Table S7).

Table 4 Assembly statistics of 15 cnidarian genomes.
Fig. 4
figure 4

Maximum-likelihood phylogenomic tree with divergence time of Heliopora coerulea and other cnidarians. Bootstrap support is 100 at all nodes. Each blue line indicates a 95% confidence interval for a divergence time. Numbers on the branch show the lineage-specific expanded (+) and contracted (−) gene families (details in Table S7).

Data Records

The Illumina, PacBio HiFi, and RNAseq data have been deposited in NCBI Sequence Read Archive with accession number SRR2353002364, SRR2353002465, SRR2353002566, SRR2353002667, SRR2353002768, SRR2353002869, SRR2353002970, SRR2353003071, and SRR2353003172, under Bioproject accession number PRJNA936655. The genome assembly has been deposited at GenBank with accession number JASJOG00000000073. The genome annotation (“Hco_maker_PASA_Final.gff”) and predicted genes (“Hco_v1.transcript.fasta” and “Hco_v1.protein.fasta”), lncRNA (“Hco_lncRNA.fasta”), and miRNA (“Hco_miRNA_mature.fasta”) has been deposited in the Figshare database74.

Technical Validation

The quality of H. coerulea genome assembly was assessed by several approaches: (i) comparison with the estimated genome size, which is also ~430 Mb in total length (Figs. 1b, 2); (ii) obtaining the complete mitogenome, which is 100% identical in size and gene order with a published mitogenome of the same species (GenBank: OL616236; Fig. 3); (iii) conducting QUAST analysis, which showed that the assembly statistics of H. coerulea is comparable with published cnidarian genomes (Table 4); (iv) conducting BUSCO analysis, which identified 98.4% eukaryotic BUSCOs and 94.4% metazoan BUSCOs in the H. coerulea genome, and 98.4% eukaryotic BUSCOs and 95.3% metazoan BUSCOs in its predicted gene models (Table 4); (v) conducting the analysis of genome coverage using SAMtools v1.15.175, which showed 100% genome coverage and 91.4% mapping rate of PacBio HiFi reads, and 94.8% genome coverage and 88.4% mapping rate of Illumina short reads (Table 3). These results indicated the H. coerulea assembly is of high-quality.